-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 242 -
Zephyr: Direct Distillation of LM Alignment
Paper • 2310.16944 • Published • 123 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 252 -
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper • 2412.03304 • Published • 21
OpenEvals
community
AI & ML interests
LLM evaluation
Recent Activity
A small overview of our research collabs through the years
This leaderboard evaluated 7K LLMs from Apr 2023 to Jun 2024, on ARC-c, HellaSwag, MMLU, TruthfulQA, Winogrande and GSM8K
-
Find a leaderboard
🔍126Explore and discover all leaderboards from the HF community
-
YourBench
🚀44Generate custom evaluations from your data easily!
-
Example Leaderboard Template
🥇16Duplicate this leaderboard to initialize your own!
-
Run your LLM evaluations on the hub
🐢2Generate a command to run model evaluations
This leaderboard has been evaluating LLMs from Jun 2024 on IFEval, MuSR, GPQA, MATH, BBH and MMLU-Pro
-
Open-LLM performances are plateauing, let’s make the leaderboard steep again
🏔125Explore and compare advanced language models on a new leaderboard
-
Open LLM Leaderboard
🏆13.7kTrack, rank and evaluate open LLMs and chatbots
-
open-llm-leaderboard/contents
Viewer • Updated • 4.58k • 10.4k • 21 -
open-llm-leaderboard/results
Preview • Updated • 21.5k • 16
A small overview of our research collabs through the years
-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 242 -
Zephyr: Direct Distillation of LM Alignment
Paper • 2310.16944 • Published • 123 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 252 -
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper • 2412.03304 • Published • 21
-
Find a leaderboard
🔍126Explore and discover all leaderboards from the HF community
-
YourBench
🚀44Generate custom evaluations from your data easily!
-
Example Leaderboard Template
🥇16Duplicate this leaderboard to initialize your own!
-
Run your LLM evaluations on the hub
🐢2Generate a command to run model evaluations
This leaderboard has been evaluating LLMs from Jun 2024 on IFEval, MuSR, GPQA, MATH, BBH and MMLU-Pro
-
Open-LLM performances are plateauing, let’s make the leaderboard steep again
🏔125Explore and compare advanced language models on a new leaderboard
-
Open LLM Leaderboard
🏆13.7kTrack, rank and evaluate open LLMs and chatbots
-
open-llm-leaderboard/contents
Viewer • Updated • 4.58k • 10.4k • 21 -
open-llm-leaderboard/results
Preview • Updated • 21.5k • 16
This leaderboard evaluated 7K LLMs from Apr 2023 to Jun 2024, on ARC-c, HellaSwag, MMLU, TruthfulQA, Winogrande and GSM8K