Commit
·
1ab3b9f
1
Parent(s):
efde5e9
Update v2 results
Browse files
README.md
CHANGED
|
@@ -57,7 +57,8 @@ Table 1: Performance (pass@1) comparison for benchmarks across Math domain.
|
|
| 57 |
| DeepSeek-R1-Distill-Qwen-1.5B | 28.54 | 22.71 | 62.58 | 82.90 | 26.38 | 43.58 | 44.45 |
|
| 58 |
| DeepScaleR-1.5B | 40.21 | 31.46 | 73.04 | 89.36 | 41.57 | 51.63 | 54.54 |
|
| 59 |
| *DeepSeek-R1-Distill-Qwen-7B* | 53.54 | 40.83 | 82.83 | 93.68 | 50.60 | 57.66 | 63.19 |
|
| 60 |
-
| **Nemotron-Research-Reasoning-Qwen-1.5B** |
|
|
|
|
| 61 |
|
| 62 |
Table 2: Performance (pass@1) comparison across benchmarks for Code. We abbreviate benchmarks names for codecontests (cc), codeforces (cf), humanevalplus (human), and livecodebench (LCB).
|
| 63 |
| Model | apps | cc | cf | taco | human | LCB | Avg |
|
|
@@ -65,19 +66,21 @@ Table 2: Performance (pass@1) comparison across benchmarks for Code. We abbrevia
|
|
| 65 |
| DeepSeek-R1-Distill-Qwen-1.5B | 20.95 | 16.79 | 14.13 | 8.03 | 61.77 | 16.80 | 23.08 |
|
| 66 |
| DeepCoder-1.5B | 30.37 | 23.76 | 21.70 | 13.76 | 73.40 | 22.76 | 30.96 |
|
| 67 |
| *DeepSeek-R1-Distill-Qwen-7B* | 42.08 | 32.76 | 33.08 | 19.08 | 83.32 | 38.04 | 41.39 |
|
| 68 |
-
| **Nemotron-Research-Reasoning-Qwen-1.5B** |
|
|
|
|
| 69 |
|
| 70 |
Table 3: Performance comparison on STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym) tasks. We also present results on OOD tasks: acre, boxnet, and game_of_life_halting (game).
|
| 71 |
| Model | GPQA | IFEval | Reasoning | acre | boxnet | game |
|
| 72 |
|-------------------------------|--------|--------|-----------|--------|--------|--------|
|
| 73 |
| DeepSeek-R1-Distill-Qwen-1.5B | 15.86 | 44.05 | 4.24 | 5.99 | 0.00 | 3.49 |
|
| 74 |
| *DeepSeek-R1-Distill-Qwen-7B* | 35.44 | 58.01 | 28.55 | 20.21 | 1.71 | 12.94 |
|
| 75 |
-
| **Nemotron-Research-Reasoning-Qwen-1.5B** | **41.78** |
|
|
|
|
| 76 |
|
| 77 |
|
| 78 |
## Nemotron-Research-Reasoning-Qwen-1.5B-v2
|
| 79 |
|
| 80 |
-
In the wake of the release of Nemotron-Research-Reasoning-Qwen-1.5B, we
|
| 81 |
Nemotron-Research-Reasoning-Qwen-1.5B-v2 builds on top of REINFORCE++-baseline with dynamic sampling and clip-higher, and proposes several critical enhancements such as periodically refreshing the reference model with the current best checkpoint and imposing the length penalty only in scheduled cycles.
|
| 82 |
Together, these techniques allow model performance to continually improve with more RL training steps and expand LLMs' reasoning boundaries.
|
| 83 |
Our latest checkpoint, Nemotron-Research-Reasoning-Qwen-1.5B-v2, trained for 3000 steps, sets a new state-of-the-art (SOTA) among 1.5B reasoning models.
|
|
|
|
| 57 |
| DeepSeek-R1-Distill-Qwen-1.5B | 28.54 | 22.71 | 62.58 | 82.90 | 26.38 | 43.58 | 44.45 |
|
| 58 |
| DeepScaleR-1.5B | 40.21 | 31.46 | 73.04 | 89.36 | 41.57 | 51.63 | 54.54 |
|
| 59 |
| *DeepSeek-R1-Distill-Qwen-7B* | 53.54 | 40.83 | 82.83 | 93.68 | 50.60 | 57.66 | 63.19 |
|
| 60 |
+
| **Nemotron-Research-Reasoning-Qwen-1.5B** | 48.13 | 33.33 | 79.29 | 91.89 | 47.98 | 60.22 | 60.14 |
|
| 61 |
+
| **Nemotron-Research-Reasoning-Qwen-1.5B-v2** | **49.58** | **36.04** | **82.53** | **92.49** | **49.03** | **60.44** | **61.69** |
|
| 62 |
|
| 63 |
Table 2: Performance (pass@1) comparison across benchmarks for Code. We abbreviate benchmarks names for codecontests (cc), codeforces (cf), humanevalplus (human), and livecodebench (LCB).
|
| 64 |
| Model | apps | cc | cf | taco | human | LCB | Avg |
|
|
|
|
| 66 |
| DeepSeek-R1-Distill-Qwen-1.5B | 20.95 | 16.79 | 14.13 | 8.03 | 61.77 | 16.80 | 23.08 |
|
| 67 |
| DeepCoder-1.5B | 30.37 | 23.76 | 21.70 | 13.76 | 73.40 | 22.76 | 30.96 |
|
| 68 |
| *DeepSeek-R1-Distill-Qwen-7B* | 42.08 | 32.76 | 33.08 | 19.08 | 83.32 | 38.04 | 41.39 |
|
| 69 |
+
| **Nemotron-Research-Reasoning-Qwen-1.5B** | 41.99 | 31.80 | 34.50 | 20.81 | 72.05 | 23.81 | 37.49 |
|
| 70 |
+
| **Nemotron-Research-Reasoning-Qwen-1.5B-v2** | **46.39** | **35.59** | **40.75** | **22.89** | 72.89 | **27.69** | **41.03** |
|
| 71 |
|
| 72 |
Table 3: Performance comparison on STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym) tasks. We also present results on OOD tasks: acre, boxnet, and game_of_life_halting (game).
|
| 73 |
| Model | GPQA | IFEval | Reasoning | acre | boxnet | game |
|
| 74 |
|-------------------------------|--------|--------|-----------|--------|--------|--------|
|
| 75 |
| DeepSeek-R1-Distill-Qwen-1.5B | 15.86 | 44.05 | 4.24 | 5.99 | 0.00 | 3.49 |
|
| 76 |
| *DeepSeek-R1-Distill-Qwen-7B* | 35.44 | 58.01 | 28.55 | 20.21 | 1.71 | 12.94 |
|
| 77 |
+
| **Nemotron-Research-Reasoning-Qwen-1.5B** | **41.78** | 66.02 | 59.06 | **58.57** | **7.91** | **52.29** |
|
| 78 |
+
| **Nemotron-Research-Reasoning-Qwen-1.5B-v2** | 41.32 | **70.85** | **62.49** | - | - | - |
|
| 79 |
|
| 80 |
|
| 81 |
## Nemotron-Research-Reasoning-Qwen-1.5B-v2
|
| 82 |
|
| 83 |
+
In the wake of the release of Nemotron-Research-Reasoning-Qwen-1.5B, we scaling the training steps from 2000 to 3000, resulting in Nemotron-Research-Reasoning-Qwen-1.5B-v2.
|
| 84 |
Nemotron-Research-Reasoning-Qwen-1.5B-v2 builds on top of REINFORCE++-baseline with dynamic sampling and clip-higher, and proposes several critical enhancements such as periodically refreshing the reference model with the current best checkpoint and imposing the length penalty only in scheduled cycles.
|
| 85 |
Together, these techniques allow model performance to continually improve with more RL training steps and expand LLMs' reasoning boundaries.
|
| 86 |
Our latest checkpoint, Nemotron-Research-Reasoning-Qwen-1.5B-v2, trained for 3000 steps, sets a new state-of-the-art (SOTA) among 1.5B reasoning models.
|