Update README.md
Browse files
README.md
CHANGED
@@ -28,7 +28,7 @@ A 1.5B parameter math reasoning model fine-tuned with **TreeRPO**, a hierarchica
|
|
28 |
|
29 |
## Model Details
|
30 |
- **Base model:** [`Qwen/Qwen2.5-Math-1.5B`](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B)
|
31 |
-
- **Method:** TreeRPO (tree-structured GRPO;
|
32 |
- **Reward signal:** Deterministic exact-match checker (binary). Interior node rewards = mean descendant leaf rewards.
|
33 |
- **Domain:** Grade-school and intermediate math word problems (GSM8K style)
|
34 |
|
@@ -45,7 +45,7 @@ Open-ended or unsafe dialog, general factual QA, or high-stakes applications.
|
|
45 |
| Model | Greedy (%) | Maj@8 (%) | Notes |
|
46 |
|---------------------------------|------------|-----------|--------------------------------------|
|
47 |
| Qwen2.5-Math-1.5B-Instruct | 84.8 | 89.5 | Reported settings |
|
48 |
-
| **
|
49 |
|
50 |
- **Greedy:** temperature = 0 (deterministic)
|
51 |
- **Maj@8:** 8 completions (temperature 0.7, top-p 0.8); majority vote on final boxed answer
|
|
|
28 |
|
29 |
## Model Details
|
30 |
- **Base model:** [`Qwen/Qwen2.5-Math-1.5B`](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B)
|
31 |
+
- **Method:** TreeRPO (tree-structured GRPO;)
|
32 |
- **Reward signal:** Deterministic exact-match checker (binary). Interior node rewards = mean descendant leaf rewards.
|
33 |
- **Domain:** Grade-school and intermediate math word problems (GSM8K style)
|
34 |
|
|
|
45 |
| Model | Greedy (%) | Maj@8 (%) | Notes |
|
46 |
|---------------------------------|------------|-----------|--------------------------------------|
|
47 |
| Qwen2.5-Math-1.5B-Instruct | 84.8 | 89.5 | Reported settings |
|
48 |
+
| **Qwen2.5-Math-1.5B-TreeRPO** | **86.4** | **89.6** | Same decoding (temp 0 / (0.7, 0.8)) |
|
49 |
|
50 |
- **Greedy:** temperature = 0 (deterministic)
|
51 |
- **Maj@8:** 8 completions (temperature 0.7, top-p 0.8); majority vote on final boxed answer
|