Trouble Reproducing gemma-3-270m-it IFEval Score
I'm trying to verify my setup by reproducing the IFEval benchmark score for gemma-3-270m-it. The official score is 51.2%, but my accuracy is only between 20-27% (run multiple times).
I am using the following settings:
temperature=1.0
top_p=0.95
top_k=64
min_p=0.0
Am I missing something? I suspect there's a misconfiguration somewhere in my setup.
+1. cannot reproduce IFEval score too, my evaluation results are among ~26%.
I'm working with temperature=0.2 and it's better
With
temperature=0.2
top_p=0.95
top_k=64
min_p=0.0
It got 27.9% on IFEval. It's a slight improvement, but there's still a gap to the 51.2%.
I honestly don’t know—maybe try 0.0 or 0.1 😅. Good luck.
Try this?
temperature = 0.1 // less random token picking
top_p = 0.95
top_k = 64
min_p = 0.25 // lower minimum probability
I’m really curious about IFEval score
By the way, I’m using llama.cpp. I forgot to mention that last time.
I am also having trouble replicating the results reported. I am using the standard lm_eval harness. I get the following results, and the biggest gap is in IFEval (with inst_level_loose_acc metric).
Gemma 3 270M IT - Actual Results vs Google's Reported Baseline
Benchmark | n-shot | Actual Results | Google Reported | Delta | Match Status |
---|---|---|---|---|---|
HellaSwag | 0-shot | 33.5% | 37.7% | -4.2% | ❌ Lower |
PIQA | 0-shot | 65.6% | 66.2% | -0.6% | ✅ Close |
ARC-c | 0-shot | 24.5% | 28.2% | -3.7% | ❌ Lower |
WinoGrande | 0-shot | 53.2% | 52.3% | +0.9% | ✅ Close |
BIG-Bench Hard | 3-shot | 26.8% | 26.7% | +0.1% | ✅ Match |
IFEval (inst_level) | 0-shot | 37.7% | 51.2% | -13.5% | ⚠️ Gap |