Update README.md
Browse files
README.md
CHANGED
@@ -67,6 +67,7 @@ We report measured benchmark results for our R1T2, R1T models and published benc
|
|
67 |
| AIME-25 | 70.0 | 58.3 | 49.6 | 70.0 | 87.5 | | V3-0324 AIME-25 measured by us |
|
68 |
| GPQA-Diamond | 77.9 | 72.0 | 68.4 | 71.5 | 81.0 | | |
|
69 |
| Aider Polyglot | 64.4 | 48.4 | 44.9 | 52.0 | 71.6 | R1T2 beats two of its parents, V3-0324 and R1, and was measured to be about 2.2 times more token efficient, i.e. faster, than its third parent, R1-0528 | R1T2 source: Aider discord, t=0.75 |
|
|
|
70 |
| EQ-Bench Longform Creative Writing | 76.4 | ./. | 78.1 | 74.6 | 78.9 | EQ Bench version before August 8th, 2025 | see [EQ Bench](https://eqbench.com/creative_writing_longform.html) |
|
71 |
| Vectara Hallucination Rate | 5.5 | ./. | 8.0 | 14.3 | 7.7 | lower hallucination rates are better, R1T2 is better than all its three parents | see [Hallucination Leaderboard](https://github.com/vectara/hallucination-leaderboard) |
|
72 |
|
|
|
67 |
| AIME-25 | 70.0 | 58.3 | 49.6 | 70.0 | 87.5 | | V3-0324 AIME-25 measured by us |
|
68 |
| GPQA-Diamond | 77.9 | 72.0 | 68.4 | 71.5 | 81.0 | | |
|
69 |
| Aider Polyglot | 64.4 | 48.4 | 44.9 | 52.0 | 71.6 | R1T2 beats two of its parents, V3-0324 and R1, and was measured to be about 2.2 times more token efficient, i.e. faster, than its third parent, R1-0528 | R1T2 source: Aider discord, t=0.75 |
|
70 |
+
| MMLU-Pro Computer Science | 83.7-85.6 | 82.9-84.6 | 81.5-82.4 | 85.1-85.3 | 84.6-86.1 | | |
|
71 |
| EQ-Bench Longform Creative Writing | 76.4 | ./. | 78.1 | 74.6 | 78.9 | EQ Bench version before August 8th, 2025 | see [EQ Bench](https://eqbench.com/creative_writing_longform.html) |
|
72 |
| Vectara Hallucination Rate | 5.5 | ./. | 8.0 | 14.3 | 7.7 | lower hallucination rates are better, R1T2 is better than all its three parents | see [Hallucination Leaderboard](https://github.com/vectara/hallucination-leaderboard) |
|
73 |
|