AutoBench Third Run: Revolutionizing LLM Evaluation with Record-Breaking Scale, Accuracy, and a New Home at autobench.org

Community Article Published August 20, 2025

Posted on August 20, 2025

The third public run of AutoBench has landed, setting a new benchmark for Large Language Model (LLM) evaluation with unmatched scale and precision. Ranking 33 models with over 300,000 individual ranks, it achieves correlations of 90%* with leading benchmarks and delivers granular insights into model quality, cost, and speed – all fully automated and open-source. And as a surprise, we’ve launched autobench.org, your new hub for transparent AI benchmarking.

Building on our first run and second run, this release pushes the limits of LLM evaluation. Here, we’ll unpack the methodology, dive into the third run’s massive stats, highlight top performers and efficiency insights, unveil the new website, acknowledge our partners, and connect to the broader ecosystem, including Bot Scanner. Let’s get started!

The LLM Evaluation Crisis and Why AutoBench Matters

With thousands of LLMs flooding the AI landscape, choosing the right model is daunting. Traditional benchmarks are static, gameable, and often too broad to reveal domain-specific strengths. Human evaluations are slow, costly, and subjective, limiting scalability. AutoBench tackles these issues with its Collective-LLM-as-a-Judge methodology, using LLMs to dynamically generate questions, provide answers, and rank outputs. This creates an ungameable, scalable, and objective evaluation system, measuring performance against the AI ecosystem’s consensus.

The result? Accurate insights into LLM quality, real-world costs, and production speed, empowering developers, enterprises, and researchers to make informed choices and avoid costly errors in AI agent workflows.

How AutoBench Works: A Quick Methodology Recap

AutoBench’s automated, iterative workflow ensures robustness and statistical significance:

Dynamic Question Generation: Randomly select a topic (e.g., Math, Coding, History) and difficulty. An LLM generates a unique question.
Quality Control: Other LLMs rank the question for clarity, relevance, and difficulty, requiring a high threshold (e.g., 4.3/5 average) to proceed.
Parallel Answer Generation: All models generate answers simultaneously.
Collective Ranking: Every answer is scored (1-5) by all judge LLMs for correctness, clarity, and relevance, yielding thousands of evaluations.
Weighted Aggregation: Ranks are combined, with consistent judges gaining more influence for reliable final scores.

This cycle repeats hundreds of times, producing aggregate and domain-specific ranks, plus efficiency metrics like cost per answer, average duration, and P99 latency. It’s all open-source and customizable – explore the details on autobench.org or our Hugging Face Space.

Third Run: Unmatched Scale and Insights

The third run, completed in August 2025, is our most ambitious yet:

Models Ranked: 33, spanning top providers like OpenAI, Google, Anthropic, and Alibaba.
LLM Rankers (Judges): 24, ensuring diverse and robust judgments.
Iterations (Generated Questions): 410, each a novel challenge across domains.
Unique Answers Generated: ~13,000.
Individual Ranks Collected: ~300,000.
Tokens Processed: ~200 million output tokens and ~700 million input tokens.

This massive dataset, processed automatically, underscores AutoBench’s ability to handle the LLM explosion with precision and efficiency.

Validation: Industry-Leading Correlations

AutoBench correlations with leading benchmarks.

Our third run’s correlations with established benchmarks confirm AutoBench’s reliability:

Artificial Analysis Intelligence Index (AAII): 92.17% – Near-perfect alignment.
LMSYS Chatbot Arena (Human Preference): 86.85% – Strong agreement with ELO scores.
MMLU-Plus: 75.44% – Robust for knowledge-intensive tasks.

These 85-95% correlations are the gold standard, proving our methodology captures true model capabilities without the pitfalls of static datasets or human bias.

Leaderboard Highlights: Top Models and Surprises

The results showcase fierce competition and unexpected standouts:

OpenAI top models – Gpt-5 with 4.5116 average rank, along Gpt-5-mini and Gpt-oss-120b dominate the 3 top spots.
Gemini 2.5 Pro (Google) – 4.4169, excelling in creative and nuanced tasks.
Qwen 3 235B A22B Thinking 2507(Anthropic) – Strong contender, especially in reasoning-heavy domains.

Outlier Alert: Open-source model GPT OSS 120B is hitting state-of-the-art levels, democratizing high performance. .

Domain-specific insights reveal more:

Logics: Open-source models like GPT OSS 120B outperform heavier counterparts for efficiency.
Math: GPT-5 leads, leveraging advanced reasoning modes.
Coding: Kimi K2 a strong contender to Gpt-5 dominance.

Explore these nuances on our interactive leaderboard – filter by domain, sort by metrics, and compare runs.

AutoBench average ranks for most comon models

Efficiency Exposed: Cost and Speed Matter

AutoBench goes beyond performance to deliver real-world metrics:

Cost per Answer: Full API call costs, not just tokens. Open-source models offer top value, balancing quality and affordability.
Latency: Average and P99 durations for production use. Lighter models clock sub-second responses.
Trade-Offs: Log-scale graphs visualize quality vs. cost/latency, helping enterprises save 20%+ by choosing task-specific models.

These insights are critical for AI agents, where even minor inefficiencies can cascade into failures.

Scatter plot: Models by cost vs. rank, highlighting value leaders.

The Big Reveal: autobench.org is Live!

The third run’s scale was anticipated, but here’s the surprise: autobench.org is now live! This is your hub for:

Interactive leaderboards with sortable metrics and domain filters.
Methodology breakdowns and customizable benchmark guides.
Enterprise services for tailored evals (e.g., medical diagnostics, legal analysis).
Updates, blogs, and contact forms for consultations.

Run AutoBench yourself or reach out for custom solutions. This launch makes our mission of transparent AI evaluation more accessible than ever.

Screenshot: Hero section of the new site.

Gratitude to Our Partners and Ecosystem

This run was powered by an incredible network:

Translated and Marco Trombetti: Compute credits, insights, and website development.
DIAG, University of Rome La Sapienza: Team led by Fabrizio Silvestri, for rigorous scientific validation.
eZecute: Our parent company, with 30+ investments and global expertise in AI and agrifood-tech. Learn more at eZecute.com.

AutoBench ties into our ecosystem, including Bot Scanner (botscanner.ai, @BotScanner_AI). Dubbed the "Skyscanner of LLM responses," it ranks answers from 40+ models in real-time using AutoBench’s methodology. This run leveraged Bot Scanner’s API for efficiency – try it with $3 free credits.

Open-Source and Community Power

AutoBench is proudly open-source. Contributors are already shaping its future – join them! Full data, samples, and methodology are on Hugging Face. Fork the third run, run your own benchmarks, or add models.

More info and data in our Hugging Face Repository.

Engage with us on Hugging Face, X (@pwk), or the new website. Let’s build the future of AI evaluation together.

Conclusion: Join the Paradigm Shift

The third run cements AutoBench as a leader in LLM evaluation: 33 models, 300k ranks, 92%+ correlations, and actionable insights into performance and efficiency. In a chaotic AI landscape, we provide clarity for developers, enterprises, and researchers.

What’s your take? Comment below!

Visit autobench.org, explore the leaderboard, or reach out on our website contact form for custom evals. Let’s make AI transparent together.

#AutoBench #LLM #AIBenchmarking #OpenSourceAI #HuggingFace #Leaderboards @Benchmarks

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote