TNGHK commited on
Commit
c0f05b5
·
verified ·
1 Parent(s): 2827249

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -0
README.md CHANGED
@@ -67,6 +67,7 @@ We report measured benchmark results for our R1T2, R1T models and published benc
67
  | AIME-25 | 70.0 | 58.3 | 49.6 | 70.0 | 87.5 | | V3-0324 AIME-25 measured by us |
68
  | GPQA-Diamond | 77.9 | 72.0 | 68.4 | 71.5 | 81.0 | | |
69
  | Aider Polyglot | 64.4 | 48.4 | 44.9 | 52.0 | 71.6 | R1T2 beats two of its parents, V3-0324 and R1, and was measured to be about 2.2 times more token efficient, i.e. faster, than its third parent, R1-0528 | R1T2 source: Aider discord, t=0.75 |
 
70
  | EQ-Bench Longform Creative Writing | 76.4 | ./. | 78.1 | 74.6 | 78.9 | EQ Bench version before August 8th, 2025 | see [EQ Bench](https://eqbench.com/creative_writing_longform.html) |
71
  | Vectara Hallucination Rate | 5.5 | ./. | 8.0 | 14.3 | 7.7 | lower hallucination rates are better, R1T2 is better than all its three parents | see [Hallucination Leaderboard](https://github.com/vectara/hallucination-leaderboard) |
72
 
 
67
  | AIME-25 | 70.0 | 58.3 | 49.6 | 70.0 | 87.5 | | V3-0324 AIME-25 measured by us |
68
  | GPQA-Diamond | 77.9 | 72.0 | 68.4 | 71.5 | 81.0 | | |
69
  | Aider Polyglot | 64.4 | 48.4 | 44.9 | 52.0 | 71.6 | R1T2 beats two of its parents, V3-0324 and R1, and was measured to be about 2.2 times more token efficient, i.e. faster, than its third parent, R1-0528 | R1T2 source: Aider discord, t=0.75 |
70
+ | MMLU-Pro Computer Science | 83.7-85.6 | 82.9-84.6 | 81.5-82.4 | 85.1-85.3 | 84.6-86.1 | | |
71
  | EQ-Bench Longform Creative Writing | 76.4 | ./. | 78.1 | 74.6 | 78.9 | EQ Bench version before August 8th, 2025 | see [EQ Bench](https://eqbench.com/creative_writing_longform.html) |
72
  | Vectara Hallucination Rate | 5.5 | ./. | 8.0 | 14.3 | 7.7 | lower hallucination rates are better, R1T2 is better than all its three parents | see [Hallucination Leaderboard](https://github.com/vectara/hallucination-leaderboard) |
73