Multilingual Benchmarks Overview

Updated 17 January 2026

Multilingual benchmarks are evaluation suites that measure AI model capabilities across a range of language tasks with parallel, culturally calibrated test sets.
They employ rigorous translation, localization, and quality control methods to expose performance gaps and challenges in low-resource languages.
Empirical studies reveal significant cross-lingual disparities, underscoring the need for equitable and robust AI systems.

A multilingual benchmark is a systematically designed evaluation suite that measures the capabilities of models—most often LLMs, but also code generators, agentic systems, or machine translation engines—across multiple natural (and sometimes programming) languages, typically spanning fundamental to advanced tasks. The objectives of such benchmarks are to rigorously assess cross-lingual generalization, reveal disparities in performance across languages, highlight challenges in low-resource settings, and foster the development of equitable, robust, and culturally grounded AI systems.

1. Scope and Motivation

The recent proliferation of LLMs and generative AI has catalyzed the creation of multilingual benchmarks as the primary mechanism for assessing progress in cross-lingual transfer, functional competence, reasoning, translation, and agentic behavior across a spectrum of languages. Early benchmarks focused on monolingual (usually English) tasks, but state-of-the-art suites now include hundreds of languages (e.g., AI Language Proficiency Monitor’s 200, MuBench’s 61, MultiLoKo’s 31) and a diverse task mix spanning translation, QA, code generation, math, reasoning, multi-step agentic workflows, and security (Han et al., 24 Jun 2025, Pomerenke et al., 11 Jul 2025, Hupkes et al., 14 Apr 2025, Zhang et al., 2024, Wang et al., 21 May 2025). The central rationale is twofold: (i) LLMs are increasingly deployed globally, demanding fair and reliable evaluation in all major languages; (ii) monolingual or naively translated benchmarks obscure genuine linguistic and cultural subtleties, thereby overstating models’ real-world readiness (Almeida et al., 17 Sep 2025).

2. Benchmark Design Principles and Task Coverage

Leading multilingual benchmarks are governed by principles of language diversity, parallel task structure, rigorous translation/localization, and multi-granular evaluation. Design choices include:

Language Selection: Inclusion is guided by speaker population, typological diversity, resource availability, and sometimes explicit coverage goals (e.g., MuBench’s 61 languages by native speakers and token share, X-WebAgentBench’s 14 languages by XNLI criterion to maximize typological and regional spread) (Han et al., 24 Jun 2025, Wang et al., 21 May 2025).
Task and Domain Coverage: Benchmarks typically map to the following categories:
- Core NLP: Machine translation (FLORES, FLORES+), QA (XQuAD, MMLU, MMMLU, GlobalMMLU, MultiLoKo), natural language inference (XNLI, SNLI, MultiNLI).
- Reasoning and Math: Mathematical reasoning (PolyMath, MGSM, CL-GSM Symbolic), logical reasoning (MLogiQA), chain-of-thought.
- Coding and Program Synthesis: Code generation/completion (mHumanEval, McEval, HumanEval-XL) in both multilingual prompt and programming language settings (Raihan et al., 2024, Chai et al., 2024).
- Agentic Evaluation: Interactive, function-calling, or web-based tasks (Ticket-Bench for regionally grounded agent workflows (Almeida et al., 17 Sep 2025); X-WebAgentBench for interactive web shopping (Wang et al., 21 May 2025); MAPS for agent security and tool-use (Hofman et al., 21 May 2025)).
- Functional and Instruction Following: Cross-lingual symbolic math and instruction-following tasks (CL-GSM Symbolic, CL-IFEval) (Ojewale et al., 25 Jun 2025).
- Low-resource and Minoritized Languages: CreoleVal (28 Creoles) (Lent et al., 2023), domain-specific codeswitched and local knowledge tests (MultiLoKo) (Hupkes et al., 14 Apr 2025).

Most modern benchmarks enforce strict parallelism—identically structured instances across all languages—facilitating direct performance and consistency comparison (Zhang et al., 2024, Han et al., 24 Jun 2025, Pomerenke et al., 11 Jul 2025).

3. Dataset Construction and Cultural Localization

Construction methodologies differentiate high-fidelity benchmarks:

Parallel Translation and Calibration: Machine translation is often used to generate initial drafts (via DeepL, ChatGPT, GPT-4o, NLLB, Google Translate), followed by professional or expert bilingual review for semantic accuracy, fluency, and cultural/idiomatic fidelity (Han et al., 24 Jun 2025, Zhang et al., 2024).
Cultural and Domain Adaptation: Region-specific entities replace generic ones (e.g., Ticket-Bench’s localized soccer leagues and team names for each language; MultiLoKo’s knowledge questions authored from the top-visited Wikipedia pages of each language, filtered for local relevance) (Almeida et al., 17 Sep 2025, Hupkes et al., 14 Apr 2025).
Quality Control Procedures: Systematic error classification (minor/major/critical), back-translation validation, and rating for cultural appropriateness are standard (Taguchi et al., 28 Aug 2025, Barth et al., 18 Feb 2025).
Pragmatic Splits: Many benchmarks provide both dev (high-frequency, familiar topics) and out-of-distribution test splits (low-frequency or blind-tail topics) to probe robust generalization (Hupkes et al., 14 Apr 2025).

A recurring insight is that locally sourced and human-authored data more accurately reflects real-world language behavior than translated English test sets or synthetic items (Wu et al., 22 Apr 2025, Hupkes et al., 14 Apr 2025).

4. Evaluation Metrics and Analysis

Multilingual benchmarks deploy a range of task-appropriate metrics:

Classification and QA: Accuracy ( $\frac{\# \text{correct}}{\# \text{instances}}$ ), F₁ (for span or entity tasks), and Exact Match (EM).
Translation: BLEU ( $\mathrm{BLEU} = \exp(\ldots)$ ), ChrF++, and specialized variants (SpBLEU, TQS, TQS_MQM for FLORES+ (Taguchi et al., 28 Aug 2025, Pomerenke et al., 11 Jul 2025)).
Code Generation: Pass@k ( $1-\frac{\binom{n-c}{k}}{\binom{n}{k}}$ ), typically Pass@1 as primary metric (Raihan et al., 2024, Chai et al., 2024).
Agentic/Planning: TaskScore (fraction of tasks completed correctly), step-efficiency, cross-lingual disparity as standard deviation, consistency (e.g., Ticket-Bench’s pass³, $\text{pass}^3 = \frac{1}{N}\sum_{i=1}^N p_i$ ), execution accuracy for program analysis (Almeida et al., 17 Sep 2025, Wang et al., 21 May 2025, Pham et al., 29 Sep 2025).
Novel Cross-Lingual Measures: Multilingual Consistency (MLC), quantifying the fraction of identical answers across aligned items in different languages, and Mother-Tongue Effect (MTE), capturing the difference between asking in a local language versus English (Han et al., 24 Jun 2025, Hupkes et al., 14 Apr 2025).

Special analyses quantify robustness (performance gap between best/worst languages), functional drops (performance loss from static to functional evaluation), and translation-induced artifacts (Ojewale et al., 25 Jun 2025, Wu et al., 22 Apr 2025). Comprehensive leaderboards (e.g., AI Language Proficiency Monitor (Pomerenke et al., 11 Jul 2025)) and diagnostic heatmaps facilitate comparative and longitudinal analysis.

5. Empirical Findings and Cross-Lingual Disparities

Empirical evidence reveals that:

Large, Reasoning-Optimized LLMs consistently outperform smaller or non-reasoning models in both accuracy and cross-language consistency but still exhibit notable cross-lingual gaps (e.g., up to 5–20 points absolute difference across languages, even for top-tier GPT-5 or Qwen3-235B) (Almeida et al., 17 Sep 2025, Zhang et al., 2024, Han et al., 24 Jun 2025).
Low-Resource Languages Languish: High-resource languages (e.g., English, Chinese, Spanish, German) show 10–40 points higher accuracy than low-resource or non-European languages. Disparities are more pronounced in generation tasks, complex reasoning, and agentic workflows (Han et al., 24 Jun 2025, Hupkes et al., 14 Apr 2025, Pomerenke et al., 11 Jul 2025, Zhang et al., 2024).
Functional Evaluations Reveal Hidden Weaknesses: Functional and instruction-following tasks (CL-GSM Symbolic, CL-IFEval) induce drops of 15–30 percentage points in cross-lingual fidelity compared to static benchmarks, amplifying gaps masked by multiple-choice accuracy (Ojewale et al., 25 Jun 2025).
Cultural Fidelity Is Crucial: Translated benchmarks—especially those that are English-centric or reliant on named entities—yield both over- and under-estimation of true cross-lingual performance. Locally-authored test sets and adaptation to regional context are necessary to surface authentic linguistic challenges (Wu et al., 22 Apr 2025, Taguchi et al., 28 Aug 2025).

6. Challenges: Contamination, Data Quality, and Validity

Unintended “contamination” of test sets in LLM pretraining is now nearly universal for prominent benchmarks. The Black-Box Test methodology shows 45/49 model–benchmark pairs exhibit contamination, which can dramatically inflate scores and mask weaknesses in zero-shot cross-lingual transfer (Ahuja et al., 2024). Other recurring issues include:

Translationese and Bias: Artifacts from machine translation, anglocentric source data, and lack of regional adaptation (cultural bias) can lead to either artificial performance gains or unfair penalties (Barth et al., 18 Feb 2025, Taguchi et al., 28 Aug 2025).
Data Quality and Drift: Weakness in translation protocols, lack of post-editing, and insufficient expert review permit lexical, syntactic, and register errors; these are particularly problematic in technical, mathematical, or programmatic benchmarks (Taguchi et al., 28 Aug 2025, Wang et al., 25 Apr 2025).
Insufficient Coverage: Most benchmarks overrepresent English and a handful of “G5” languages (China, India, Germany, UK, USA); only recent efforts integrate truly global and minoritized languages (Wu et al., 22 Apr 2025, Lent et al., 2023).
Practical Relevance: Many benchmarks, especially those synthesized from Wikipedia or news, are less reflective of actual user needs or low-level system integration (e.g., interactive agents, secure tool use) (Wu et al., 22 Apr 2025, Hofman et al., 21 May 2025).

7. Best Practices and Research Directions

Best practice recommendations and future priorities across leading literature include:

Rigorous Translation and Cultural Calibration: Use layered pipelines—automated machine translation, followed by professional/crowdsourced post-editing, cultural sensitivity annotation, and iterative MT quality ranking (COMET/MQM scoring) (Barth et al., 18 Feb 2025, Taguchi et al., 28 Aug 2025).
Benchmark Design: Prioritize locally authored content for new languages wherever feasible. Provide parallel partitions (local, human-translated, machine-translated) and report all variants (Hupkes et al., 14 Apr 2025).
Contamination Auditing: Employ Black-Box permutation tests to publish contamination p-values and exclude highly contaminated benchmark–model pairs from leaderboards or headline scores (Ahuja et al., 2024).
Rich, Multi-Dimensional Evaluation: Combine traditional accuracy/F₁/EM with cross-lingual consistency, robustness (max–min across languages), and error/failure mode audits. Release code and data to enable extension and reproducibility (Han et al., 24 Jun 2025, Alshehhi et al., 25 Jul 2025).
Beyond English: Strongly encourage expansion in minoritized, low-resource, and oral languages; foster regional collaborations and human-aligned evaluation for long-term inclusivity (Wu et al., 22 Apr 2025, Lent et al., 2023).
Benchmark Maintenance: Regularly update datasets and leaderboards to cope with model evolution, contamination, and obsolescence (per AI Language Proficiency Monitor and McEval’s auto-updating frameworks) (Pomerenke et al., 11 Jul 2025, Chai et al., 2024).
Cross-Domain and Modality Integration: Future directions call for broadening task coverage to generation, conversational and multimodal contexts, and deeply integrating functional, agentic, and security-critical tests (Wu et al., 22 Apr 2025, Hofman et al., 21 May 2025, Wang et al., 21 May 2025).

In sum, multilingual benchmarks have evolved into highly structured, culturally aware, and technically robust instruments that play a pivotal role in diagnosing, comparing, and ultimately improving the cross-lingual capabilities of modern LLMs and agentic systems. Their continued advancement is essential for equitable AI deployment and for closing the global digital-language divide.