shizhediao2 commited on
Commit
1ab3b9f
·
1 Parent(s): efde5e9

Update v2 results

Browse files
Files changed (1) hide show
  1. README.md +7 -4
README.md CHANGED
@@ -57,7 +57,8 @@ Table 1: Performance (pass@1) comparison for benchmarks across Math domain.
57
  | DeepSeek-R1-Distill-Qwen-1.5B | 28.54 | 22.71 | 62.58 | 82.90 | 26.38 | 43.58 | 44.45 |
58
  | DeepScaleR-1.5B | 40.21 | 31.46 | 73.04 | 89.36 | 41.57 | 51.63 | 54.54 |
59
  | *DeepSeek-R1-Distill-Qwen-7B* | 53.54 | 40.83 | 82.83 | 93.68 | 50.60 | 57.66 | 63.19 |
60
- | **Nemotron-Research-Reasoning-Qwen-1.5B** | **48.13** | **33.33** | **79.29** | **91.89** | **47.98** | **60.22** | **60.14** |
 
61
 
62
  Table 2: Performance (pass@1) comparison across benchmarks for Code. We abbreviate benchmarks names for codecontests (cc), codeforces (cf), humanevalplus (human), and livecodebench (LCB).
63
  | Model | apps | cc | cf | taco | human | LCB | Avg |
@@ -65,19 +66,21 @@ Table 2: Performance (pass@1) comparison across benchmarks for Code. We abbrevia
65
  | DeepSeek-R1-Distill-Qwen-1.5B | 20.95 | 16.79 | 14.13 | 8.03 | 61.77 | 16.80 | 23.08 |
66
  | DeepCoder-1.5B | 30.37 | 23.76 | 21.70 | 13.76 | 73.40 | 22.76 | 30.96 |
67
  | *DeepSeek-R1-Distill-Qwen-7B* | 42.08 | 32.76 | 33.08 | 19.08 | 83.32 | 38.04 | 41.39 |
68
- | **Nemotron-Research-Reasoning-Qwen-1.5B** | **41.99** | **31.80** | **34.50** | **20.81** | 72.05 | **23.81** | **37.49** |
 
69
 
70
  Table 3: Performance comparison on STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym) tasks. We also present results on OOD tasks: acre, boxnet, and game_of_life_halting (game).
71
  | Model | GPQA | IFEval | Reasoning | acre | boxnet | game |
72
  |-------------------------------|--------|--------|-----------|--------|--------|--------|
73
  | DeepSeek-R1-Distill-Qwen-1.5B | 15.86 | 44.05 | 4.24 | 5.99 | 0.00 | 3.49 |
74
  | *DeepSeek-R1-Distill-Qwen-7B* | 35.44 | 58.01 | 28.55 | 20.21 | 1.71 | 12.94 |
75
- | **Nemotron-Research-Reasoning-Qwen-1.5B** | **41.78** | **66.02** | **59.06** | **58.57** | **7.91** | **52.29** |
 
76
 
77
 
78
  ## Nemotron-Research-Reasoning-Qwen-1.5B-v2
79
 
80
- In the wake of the release of Nemotron-Research-Reasoning-Qwen-1.5B, we did not halt training but continued for an additional 1000 steps, resulting in Nemotron-Research-Reasoning-Qwen-1.5B-v2.
81
  Nemotron-Research-Reasoning-Qwen-1.5B-v2 builds on top of REINFORCE++-baseline with dynamic sampling and clip-higher, and proposes several critical enhancements such as periodically refreshing the reference model with the current best checkpoint and imposing the length penalty only in scheduled cycles.
82
  Together, these techniques allow model performance to continually improve with more RL training steps and expand LLMs' reasoning boundaries.
83
  Our latest checkpoint, Nemotron-Research-Reasoning-Qwen-1.5B-v2, trained for 3000 steps, sets a new state-of-the-art (SOTA) among 1.5B reasoning models.
 
57
  | DeepSeek-R1-Distill-Qwen-1.5B | 28.54 | 22.71 | 62.58 | 82.90 | 26.38 | 43.58 | 44.45 |
58
  | DeepScaleR-1.5B | 40.21 | 31.46 | 73.04 | 89.36 | 41.57 | 51.63 | 54.54 |
59
  | *DeepSeek-R1-Distill-Qwen-7B* | 53.54 | 40.83 | 82.83 | 93.68 | 50.60 | 57.66 | 63.19 |
60
+ | **Nemotron-Research-Reasoning-Qwen-1.5B** | 48.13 | 33.33 | 79.29 | 91.89 | 47.98 | 60.22 | 60.14 |
61
+ | **Nemotron-Research-Reasoning-Qwen-1.5B-v2** | **49.58** | **36.04** | **82.53** | **92.49** | **49.03** | **60.44** | **61.69** |
62
 
63
  Table 2: Performance (pass@1) comparison across benchmarks for Code. We abbreviate benchmarks names for codecontests (cc), codeforces (cf), humanevalplus (human), and livecodebench (LCB).
64
  | Model | apps | cc | cf | taco | human | LCB | Avg |
 
66
  | DeepSeek-R1-Distill-Qwen-1.5B | 20.95 | 16.79 | 14.13 | 8.03 | 61.77 | 16.80 | 23.08 |
67
  | DeepCoder-1.5B | 30.37 | 23.76 | 21.70 | 13.76 | 73.40 | 22.76 | 30.96 |
68
  | *DeepSeek-R1-Distill-Qwen-7B* | 42.08 | 32.76 | 33.08 | 19.08 | 83.32 | 38.04 | 41.39 |
69
+ | **Nemotron-Research-Reasoning-Qwen-1.5B** | 41.99 | 31.80 | 34.50 | 20.81 | 72.05 | 23.81 | 37.49 |
70
+ | **Nemotron-Research-Reasoning-Qwen-1.5B-v2** | **46.39** | **35.59** | **40.75** | **22.89** | 72.89 | **27.69** | **41.03** |
71
 
72
  Table 3: Performance comparison on STEM reasoning (GPQA Diamond), instruction following (IFEval), and logic puzzles (Reasoning Gym) tasks. We also present results on OOD tasks: acre, boxnet, and game_of_life_halting (game).
73
  | Model | GPQA | IFEval | Reasoning | acre | boxnet | game |
74
  |-------------------------------|--------|--------|-----------|--------|--------|--------|
75
  | DeepSeek-R1-Distill-Qwen-1.5B | 15.86 | 44.05 | 4.24 | 5.99 | 0.00 | 3.49 |
76
  | *DeepSeek-R1-Distill-Qwen-7B* | 35.44 | 58.01 | 28.55 | 20.21 | 1.71 | 12.94 |
77
+ | **Nemotron-Research-Reasoning-Qwen-1.5B** | **41.78** | 66.02 | 59.06 | **58.57** | **7.91** | **52.29** |
78
+ | **Nemotron-Research-Reasoning-Qwen-1.5B-v2** | 41.32 | **70.85** | **62.49** | - | - | - |
79
 
80
 
81
  ## Nemotron-Research-Reasoning-Qwen-1.5B-v2
82
 
83
+ In the wake of the release of Nemotron-Research-Reasoning-Qwen-1.5B, we scaling the training steps from 2000 to 3000, resulting in Nemotron-Research-Reasoning-Qwen-1.5B-v2.
84
  Nemotron-Research-Reasoning-Qwen-1.5B-v2 builds on top of REINFORCE++-baseline with dynamic sampling and clip-higher, and proposes several critical enhancements such as periodically refreshing the reference model with the current best checkpoint and imposing the length penalty only in scheduled cycles.
85
  Together, these techniques allow model performance to continually improve with more RL training steps and expand LLMs' reasoning boundaries.
86
  Our latest checkpoint, Nemotron-Research-Reasoning-Qwen-1.5B-v2, trained for 3000 steps, sets a new state-of-the-art (SOTA) among 1.5B reasoning models.