AIME 25 Accuracy Discrepancy for GPT-OSS-20B (Reasoning Effort=High)
Thank you very much for open-sourcing such a powerful large language model.
I’ve noticed that the community has run into difficulties reproducing the results reported in your paper. On the AIME 25, the paper states that GPT-OSS-20B (without tools, with reasoning effort mode set to “high”) achieved an accuracy of 91.7%, whereas our reproduction using vLLM only reached 85.8%. Do you have any suggestions to help us replicate your evaluation results?
Reference link: https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use
Hey :) We published all of our eval code here https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals
Hi
@dkundel-openai
Thank you for providing the eval scripts. I noticed on the published paper (https://arxiv.org/pdf/2508.10925v1): the generation length of aime result for gpt-oss-20b-high is 20k tokens.
But when I followed your eval scripts to run the aime inference and did some statistics on the generation results, I get mean 782 tokens for the 30*8=240 samples.
I don't know what caused the big discrepancy. Is there anything I missed?