omrisap
/

Qwen2.5-Math-1.5B-TreeRPO

Text Generation

reinforcement-learning

text-generation-inference

Model card Files Files and versions

omrisap commited on Jul 20

Commit

f94103e

·

verified ·

1 Parent(s): 76767fe

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -22,7 +22,7 @@ model_name: TreeRPO-Qwen2.5-Math-1.5B
 A 1.5B parameter math reasoning model fine-tuned with **TreeRPO**, a hierarchical extension of GRPO that assigns rewards to “thought” nodes (not just full completions). Achieves higher GSM8K accuracy with just ~10K supervised + RL examples and **no reward model**.
 🔎 **Full write-up (method, math, analysis):**
-[TreeRPO: Hierarchical Credit Assignment for Data-Efficient Math Reasoning](https://omrisapir.substack.com/publish/post/167273414)
 ---

 A 1.5B parameter math reasoning model fine-tuned with **TreeRPO**, a hierarchical extension of GRPO that assigns rewards to “thought” nodes (not just full completions). Achieves higher GSM8K accuracy with just ~10K supervised + RL examples and **no reward model**.
 🔎 **Full write-up (method, math, analysis):**
+[TreeRPO: Hierarchical Credit Assignment for Reasoning in Language Models](https://omrisapir.substack.com/publish/post/167273414)
 ---