Trouble Reproducing gemma-3-270m-it IFEval Score

#4
by fongya - opened

I'm trying to verify my setup by reproducing the IFEval benchmark score for gemma-3-270m-it. The official score is 51.2%, but my accuracy is only between 20-27% (run multiple times).

I am using the following settings:

  • temperature=1.0
  • top_p=0.95
  • top_k=64
  • min_p=0.0

Am I missing something? I suspect there's a misconfiguration somewhere in my setup.

+1. cannot reproduce IFEval score too, my evaluation results are among ~26%.

I'm working with temperature=0.2 and it's better

With

  • temperature=0.2
  • top_p=0.95
  • top_k=64
  • min_p=0.0

It got 27.9% on IFEval. It's a slight improvement, but there's still a gap to the 51.2%.

I honestly don’t know—maybe try 0.0 or 0.1 😅. Good luck.

Try this?

temperature = 0.1 // less random token picking
top_p = 0.95  
top_k = 64  
min_p = 0.25 // lower minimum probability

I’m really curious about IFEval score

By the way, I’m using llama.cpp. I forgot to mention that last time.

I am also having trouble replicating the results reported. I am using the standard lm_eval harness. I get the following results, and the biggest gap is in IFEval (with inst_level_loose_acc metric).

Gemma 3 270M IT - Actual Results vs Google's Reported Baseline

Benchmark n-shot Actual Results Google Reported Delta Match Status
HellaSwag 0-shot 33.5% 37.7% -4.2% ❌ Lower
PIQA 0-shot 65.6% 66.2% -0.6% ✅ Close
ARC-c 0-shot 24.5% 28.2% -3.7% ❌ Lower
WinoGrande 0-shot 53.2% 52.3% +0.9% ✅ Close
BIG-Bench Hard 3-shot 26.8% 26.7% +0.1% ✅ Match
IFEval (inst_level) 0-shot 37.7% 51.2% -13.5% ⚠️ Gap

Sign up or log in to comment