Trouble Reproducing gemma-3-270m-it IFEval Score

by fongya - opened 9 days ago

9 days ago

I'm trying to verify my setup by reproducing the IFEval benchmark score for gemma-3-270m-it. The official score is 51.2%, but my accuracy is only between 20-27% (run multiple times).

I am using the following settings:

temperature=1.0
top_p=0.95
top_k=64
min_p=0.0

Am I missing something? I suspect there's a misconfiguration somewhere in my setup.

beyoung

9 days ago

•

edited 8 days ago

+1. cannot reproduce IFEval score too, my evaluation results are among ~26%.

yousef1727

8 days ago

•

edited 8 days ago

I'm working with temperature=0.2 and it's better

fongya

8 days ago

With

temperature=0.2
top_p=0.95
top_k=64
min_p=0.0

It got 27.9% on IFEval. It's a slight improvement, but there's still a gap to the 51.2%.

yousef1727

7 days ago

•

edited 7 days ago

I honestly don’t know—maybe try 0.0 or 0.1 😅. Good luck.

Try this?

temperature = 0.1 // less random token picking
top_p = 0.95  
top_k = 64  
min_p = 0.25 // lower minimum probability

I’m really curious about IFEval score

By the way, I’m using llama.cpp. I forgot to mention that last time.

codelion

6 days ago

•

edited 6 days ago

I am also having trouble replicating the results reported. I am using the standard lm_eval harness. I get the following results, and the biggest gap is in IFEval (with inst_level_loose_acc metric).

Gemma 3 270M IT - Actual Results vs Google's Reported Baseline

Benchmark	n-shot	Actual Results	Google Reported	Delta	Match Status
HellaSwag	0-shot	33.5%	37.7%	-4.2%	❌ Lower
PIQA	0-shot	65.6%	66.2%	-0.6%	✅ Close
ARC-c	0-shot	24.5%	28.2%	-3.7%	❌ Lower
WinoGrande	0-shot	53.2%	52.3%	+0.9%	✅ Close
BIG-Bench Hard	3-shot	26.8%	26.7%	+0.1%	✅ Match
IFEval (inst_level)	0-shot	37.7%	51.2%	-13.5%	⚠️ Gap

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment