Multilingual LLM Evaluation
Multilingual Evaluation Benchmarks
Viewer • Updated • 602k • 9k • 129Note Global-MMLU 🌍 is a multilingual evaluation set collection of exams in 42 languages, including English. This dataset greatly improves multilingual coverage and quality of the english MMLU using professional translations and crowd-sourced post-edits. It also includes cultural sensitivity annotations and classifies them as Culturally Sensitive (CS) 🗽 or Culturally Agnostic (CA) ⚖️.
CohereLabs/Global-MMLU-Lite
Viewer • Updated • 9.84k • 6.16k • 24Note Global-MMLU-Lite is a multilingual evaluation set spanning 15 languages, including English. It is "lite" version of the original Global-MMLU dataset 🌍. The samples in Global-MMLU-Lite are corresponding to languages which are fully human translated or post-edited in the original Global-MMLU dataset.
CohereLabs/m-ArenaHard
Viewer • Updated • 11k • 372 • 22Note The m-ArenaHard dataset is an extremely challenging multilingual LLM evaluation set for measuring quaity of open-ended generations. This dataset was created by translating the prompts from the originally English-only LMarena (formerly LMSYS) arena-hard-auto-v0.1 test dataset using Google Translate API v3 to 22 languages. For each language, there are 500 challenging user queries sourced from Chatbot Arena.
CohereLabs/include-base-44
Viewer • Updated • 23k • 5.37k • 40Note INCLUDE is a comprehensive collection of in-language exams across 44 languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed. It contains 22,637 4-option multiple-choice-questions (MCQ) extracted from academic and professional exams, covering 57 topics, including regional knowledge.
CohereLabs/include-lite-44
Viewer • Updated • 10.8k • 430 • 14Note INCLUDE is a comprehensive knowledge- and reasoning-centric benchmark across 44 languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed. For a quicker evaluation, you can use include-lite-44, which is a subset of include-base-44, covering the same 44 languages.
CohereLabs/aya_redteaming
Viewer • Updated • 7.42k • 763 • 26Note The Aya Red-teaming dataset is a human-annotated multilingual red-teaming dataset consisting of harmful prompts in 8 languages across 9 different categories of harm with explicit labels for "global" and "local" harm.
CohereLabs/aya_evaluation_suite
Viewer • Updated • 26.8k • 1.55k • 51Note Aya Evaluation Suite contains open-ended conversation-style prompts to evaluate multilingual open-ended generation quality. To strike a balance between language coverage and the quality that comes with human curation, we create an evaluation suite that covers 101 languages for evaluating conversational abilities of language models.
CohereLabsCommunity/multilingual-reward-bench
Viewer • Updated • 66.8k • 679 • 30Note M-RewardBench is a benchmark for 23 typologically diverse languages. M-RewardBench contains prompt-chosen-rejected preference triples obtained by curating and translating chat, safety, and reasoning instances.