Multilingual LLM Evaluation

CohereLabs 's Collections

Command Models

Cohere Labs Aya Vision

Cohere Labs Aya Expanse

Aya Datasets

Cohere Labs Aya 23

updated 24 days ago

Multilingual Evaluation Benchmarks

Upvote

CohereLabs/Global-MMLU

Viewer • Updated 10 days ago • 602k • 9k • 129
Note Global-MMLU 🌍 is a multilingual evaluation set collection of exams in 42 languages, including English. This dataset greatly improves multilingual coverage and quality of the english MMLU using professional translations and crowd-sourced post-edits. It also includes cultural sensitivity annotations and classifies them as Culturally Sensitive (CS) 🗽 or Culturally Agnostic (CA) ⚖️.
CohereLabs/Global-MMLU-Lite

Viewer • Updated May 27 • 9.84k • 6.16k • 24
Note Global-MMLU-Lite is a multilingual evaluation set spanning 15 languages, including English. It is "lite" version of the original Global-MMLU dataset 🌍. The samples in Global-MMLU-Lite are corresponding to languages which are fully human translated or post-edited in the original Global-MMLU dataset.
CohereLabs/m-ArenaHard

Viewer • Updated Apr 15 • 11k • 372 • 22
Note The m-ArenaHard dataset is an extremely challenging multilingual LLM evaluation set for measuring quaity of open-ended generations. This dataset was created by translating the prompts from the originally English-only LMarena (formerly LMSYS) arena-hard-auto-v0.1 test dataset using Google Translate API v3 to 22 languages. For each language, there are 500 challenging user queries sourced from Chatbot Arena.
CohereLabs/include-base-44

Viewer • Updated Apr 15 • 23k • 5.37k • 40
Note INCLUDE is a comprehensive collection of in-language exams across 44 languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed. It contains 22,637 4-option multiple-choice-questions (MCQ) extracted from academic and professional exams, covering 57 topics, including regional knowledge.
CohereLabs/include-lite-44

Viewer • Updated Apr 15 • 10.8k • 430 • 14
Note INCLUDE is a comprehensive knowledge- and reasoning-centric benchmark across 44 languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed. For a quicker evaluation, you can use include-lite-44, which is a subset of include-base-44, covering the same 44 languages.
CohereLabs/aya_redteaming

Viewer • Updated Apr 15 • 7.42k • 763 • 26
Note The Aya Red-teaming dataset is a human-annotated multilingual red-teaming dataset consisting of harmful prompts in 8 languages across 9 different categories of harm with explicit labels for "global" and "local" harm.
CohereLabs/aya_evaluation_suite

Viewer • Updated Apr 15 • 26.8k • 1.55k • 51
Note Aya Evaluation Suite contains open-ended conversation-style prompts to evaluate multilingual open-ended generation quality. To strike a balance between language coverage and the quality that comes with human curation, we create an evaluation suite that covers 101 languages for evaluating conversational abilities of language models.
CohereLabsCommunity/multilingual-reward-bench

Viewer • Updated Jul 23 • 66.8k • 679 • 30
Note M-RewardBench is a benchmark for 23 typologically diverse languages. M-RewardBench contains prompt-chosen-rejected preference triples obtained by curating and translating chat, safety, and reasoning instances.

Upvote