omrisap
/

Qwen2.5-Math-1.5B-TreeRPO

Text Generation

reinforcement-learning

text-generation-inference

Model card Files Files and versions

omrisap commited on Jul 20

Commit

3aff42c

·

verified ·

1 Parent(s): f94103e

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -28,7 +28,7 @@ A 1.5B parameter math reasoning model fine-tuned with **TreeRPO**, a hierarchica
 ## Model Details
 - **Base model:** [`Qwen/Qwen2.5-Math-1.5B`](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B)
-- **Method:** TreeRPO (tree-structured GRPO; up to depth 7; branching by entropy & length)
 - **Reward signal:** Deterministic exact-match checker (binary). Interior node rewards = mean descendant leaf rewards.
 - **Domain:** Grade-school and intermediate math word problems (GSM8K style)
@@ -45,7 +45,7 @@ Open-ended or unsafe dialog, general factual QA, or high-stakes applications.
 | Model                          | Greedy (%) | Maj@8 (%) | Notes                                |
 |---------------------------------|------------|-----------|--------------------------------------|
 | Qwen2.5-Math-1.5B-Instruct      | 84.8       | 89.5      | Reported settings                    |
-| **TreeRPO-Qwen2.5-Math-1.5B**   | **86.4**   | **89.6**  | Same decoding (temp 0 / (0.7, 0.8))  |
 - **Greedy:** temperature = 0 (deterministic)
 - **Maj@8:** 8 completions (temperature 0.7, top-p 0.8); majority vote on final boxed answer

 ## Model Details
 - **Base model:** [`Qwen/Qwen2.5-Math-1.5B`](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B)
+- **Method:** TreeRPO (tree-structured GRPO;)
 - **Reward signal:** Deterministic exact-match checker (binary). Interior node rewards = mean descendant leaf rewards.
 - **Domain:** Grade-school and intermediate math word problems (GSM8K style)
 | Model                          | Greedy (%) | Maj@8 (%) | Notes                                |
 |---------------------------------|------------|-----------|--------------------------------------|
 | Qwen2.5-Math-1.5B-Instruct      | 84.8       | 89.5      | Reported settings                    |
+| **Qwen2.5-Math-1.5B-TreeRPO**   | **86.4**   | **89.6**  | Same decoding (temp 0 / (0.7, 0.8))  |
 - **Greedy:** temperature = 0 (deterministic)
 - **Maj@8:** 8 completions (temperature 0.7, top-p 0.8); majority vote on final boxed answer