Global MMLU Lite

Updated 8 February 2026

Global MMLU Lite is a reduced benchmark that compresses full multilingual evaluations via stratified sampling to maintain linguistic balance and cultural sensitivity.
It employs a multi-stage translation pipeline and expert quality control to ensure accuracy, consistency, and reliable cross-lingual performance assessments.
The framework enables rapid, cost-effective evaluation of LLMs across diverse resource tiers, informing targeted model improvements and ablation studies.

Global MMLU Lite refers to both a methodological approach and a set of concrete benchmarks designed to facilitate efficient, representative, and culturally balanced evaluations of LLMs on multilingual and cross-lingual reasoning. It operationalizes stratified sampling and rigorous quality control to compress the comprehensive Global MMLU benchmark into a smaller, cost-effective format while retaining essential statistical and cultural properties of the broader evaluation suite (Singh et al., 2024, Xuan et al., 13 Mar 2025).

1. Motivation and Framework

Multilingual evaluation benchmarks historically suffer from two critical limitations: linguistic resource imbalance and pervasive cultural bias. The Global MMLU project demonstrated that benchmarks such as MMLU, when directly translated, systematically overrepresent Western-centric cultural and regional knowledge—approximately 28% of questions in the original English set require culturally sensitive knowledge, with 84.9% of geography-related items focusing on North American or European contexts (Singh et al., 2024). Moreover, translation artifacts can distort model rankings and obscure genuine cross-linguistic capabilities. The Global MMLU Lite paradigm addresses these issues by enforcing controlled subsampling across language resource tiers, cultural-sensitivity tags, and academic subjects, delivering a reduced yet statistically robust testbed for rapid iteration, characterization of low-resource language performance deficits, and model ablation studies (Xuan et al., 13 Mar 2025).

2. Construction and Stratified Sampling

Global MMLU Lite is constructed via coordinated stratified sampling from the full Global MMLU or MMLU-ProX pools. The procedure targets three principal axes: language (balancing across high-, mid-, and low-resource categories), subject domain (ensuring broad academic coverage), and cultural sensitivity (maintaining original CS/CA class ratios).

A canonical instantiation involves the following configuration:

Axis	Target	Sampling Ratio/Formulation
Languages ( $K$ )	14 (e.g., 5 high, 5 mid, 4 low)	$\approx$ 120 questions/language, balanced tiers
Subjects ( $C$ )	STEM, Humanities, Social Sciences	$\approx$ 20 questions/language/domain
Cultural sensitivity	$p_{CS}$ ≈ 28%, $p_{CA}$ ≈ 72%	controlled by stratification

Sample sizes are controlled such that $N_{Lite} \approx 10\,000$ (Global MMLU) or $658$ (MMLU-ProX Lite) total questions, subject to the constraints

$\sum_{\ell,c} n_{\ell,c} = N_{Lite}, \quad n_{\ell,c} \propto \frac{1}{|L_{Lite}| \times |C|} \times \begin{cases} 0.28, & \text{CS} \ 0.72, & \text{CA} \ \end{cases}$

Languages are sampled to ensure no single resource tier dominates, closely following the operationalizations of Joshi et al. (2019) and the Aya Initiative. Minimal thresholds guarantee that every language and domain constitutes at least 1% of the subset, ensuring coverage (Singh et al., 2024).

3. Quality Control and Translation Pipeline

Translation and question alignment employ a multi-stage, model-augmented expert workflow:

Source Question Preprocessing: Remove typos, unify mathematical/technical notation.
Primary Translation: Use high-performance LLMs (e.g., Claude 3.7 Sonnet) for initial translation.
Self-reflection Pass: LLM flags ambiguous or domain-specific terms for further scrutiny.
Secondary Review: GPT-4o reviews and refines terminology/phrasing, comparing outputs.
Consensus Resolution: Inconsistencies are adjudicated by an additional multilingual model (e.g., Llama3).
Human Verification: Language-domain experts rate randomly sampled items (≥ 20 per language-domain pair) on a 5-point Likert scale for accuracy, fluency, and completeness. Items with sub-threshold ratings (< 4) are corrected or retranslated (Xuan et al., 13 Mar 2025).

Terminological consistency is enforced via a master glossary across all languages, with targeted prompt engineering to maintain domain-appropriate rendering (e.g., consistent translation of "activation energy" across STEM domains). For questions with inherently culture-dependent content, context notes or local analogues are introduced with input from regional experts. Round-trip semantic equivalence tests (English $\to$ $L_1$ $\to$ English) are performed on 10% samples.

4. Dataset Format, Access, and Structure

Global MMLU Lite datasets are distributed in both JSONL and CSV formats, facilitating seamless ingestion with standard data science tools.

Example schema:

{
  "id": "lite_en_00123",
  "language": "EN",
  "subject": "Mathematics",
  "difficulty": "Hard",
  "question": "If f(x)=… what is ∫… dx?",
  "choices": {"A": "...", "B": "...", "C": "...", "D": "..."},
  "answer": "C"
}

Loading is supported via native Python (json) or pandas, and datasets can be retrieved from open repositories such as GitHub or Hugging Face ("mmlu_prox_lite"), under a CC BY-NC-4.0 license. Each question records language, subject, difficulty, the stem, multiple-choice options ("A"-"D"), and answer key, enabling integration with lm-evaluation-harness and similar toolkits (Xuan et al., 13 Mar 2025).

A practical Global MMLU Lite configuration is:

Parameter	Value
Total Questions	10,000 (Global); 658/language (ProX Lite)
Languages	14 (stratified)
Domains	~6 broad fields
CS items	28%
CA items	72%
Difficulty	Spanning easy–expert

5. Evaluation Protocols and Metrics

Global MMLU Lite deployments primarily utilize two prompting protocols for LLM assessment:

Zero-shot:

1
2
3

Question: {question}
Choices: {A}, {B}, {C}, {D}
Answer:

5-shot Chain-of-Thought (CoT): Five annotated English examples precede the target language question, promoting cross-lingual transfer.

Parameter settings generally fix temperature at 0.0 (greedy), max tokens at 256, with response evaluation bounded by first valid answer marker ("A"/"B"/"C"/"D") (Xuan et al., 13 Mar 2025).

Evaluation metrics include:

Accuracy per language:

$\text{Acc}_\ell = \frac{1}{N_\ell}\sum_{i=1}^{N_\ell} 1(\hat{y}_i = y_i)$

Overall accuracy:

$\text{Acc} = \frac{1}{\sum_\ell N_\ell} \sum_\ell \sum_{i=1}^{N_\ell} 1(\hat{y}_i = y_i)$

Macro-average accuracy:

$\text{Acc}_\text{macro} = \frac{1}{L} \sum_{\ell=1}^L \text{Acc}_\ell$

Cross-lingual gap:

$\Delta = \max_\ell (\text{Acc}_\text{EN} - \text{Acc}_\ell)$

No formal p-values are reported, but a 95% confidence interval for accuracy can be computed as

$\mathrm{CI}_{95} = \hat{p} \pm 1.96 \sqrt{ \frac{\hat{p}(1-\hat{p})}{N} }$

(Singh et al., 2024, Xuan et al., 13 Mar 2025).

6. Empirical Results and Diagnostic Use Cases

Published evaluations reveal that high-resource languages consistently achieve 60–80% accuracy on Global MMLU Lite subsets with leading models (e.g., GPT-4: EN 78%; FR 72%; ZH 74%). In contrast, low-resource languages such as Bengali and Swahili observe pronounced drops (BN 52%, SW 45%), with performance gaps (Δ) exceeding 20 percentage points in challenging cases (Xuan et al., 13 Mar 2025).

Practical applications include:

Rapid comparative benchmarking of models across resource tiers, controlling for reasoning domain and question difficulty.
Diagnosis of domain-language deficits by analyzing per-domain accuracy in each language subset.
Ablation and transfer studies via in-domain CoT exemplars and prompt augmentation.
Targeted improvement cycles: Identified underperformance prompts corpus enrichment, vocabulary expansion, or finetuning with domain-aligned exemplars.

Accurate reporting requires per-language and macro-averaged accuracy, with interpretive emphasis placed on the causes behind low-resource drops—typically attributable to insufficient pretraining exposure and inconsistent term alignment. Augmentation or targeted finetuning is recommended as actionable remediation.

7. Statistical Balance, Diversity Indices, and Limitations

Statistical robustness is maintained by enforcing Shannon entropy over region and culture tags in each Lite subset: $H_\textrm{region} \ge 0.7 \times H_\textrm{full}$ and $H_\textrm{culture} \ge 0.7 \times H_\textrm{full}$ . This guarantees that Lite variants neither overfit to nor erase the original’s cultural and linguistic representativity (Singh et al., 2024).

Limitations include potential residual cultural bias due to the finite coverage of tags and the practical challenges in completely localizing domain-specific content across severely low-resource languages. Reported macro-average statistics are sensitive to the precise distribution of question types—a plausible implication is that continued refinement of tagging and validation pipelines is necessary to further reduce cultural and linguistic bias for future Global MMLU Lite releases.

References:

"Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation" (Singh et al., 2024)
"MMLU-ProX: A Multilingual Benchmark for Advanced LLM Evaluation" (Xuan et al., 13 Mar 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation (2024)

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Global MMLU Lite.