Multilingual QA Benchmarks

Updated 28 January 2026

Multilingual QA Benchmarks are standardized tasks that assess question answering systems across diverse languages and modalities.
They utilize rigorous dataset construction, translation pipelines, and evaluation metrics like EM, F1, and BLEU for cross-lingual comparability.
Emerging research focuses on overcoming cultural, low-resource, and modality-specific challenges to improve global QA performance.

Multilingual Question Answering Benchmarks provide standardized tasks and corpora for evaluating question answering (QA) systems across multiple languages and modalities. These benchmarks address the need for rigorous, comparative assessment of cross-lingual transfer, cultural understanding, and domain-specific reasoning, going beyond English-centric benchmarks to cover underrepresented scripts, low-resource languages, and cross-modalities (text, tables, images, audio, speech). The following sections detail the design principles, dataset construction methodologies, evaluation protocols, cross-lingual challenges, representative benchmarks, and emerging research directions.

1. Design Principles and Dataset Construction

Multilingual QA benchmarks are characterized by careful language selection, answer normalization, and annotation pipelines designed to guarantee cross-lingual comparability:

Parallel Data and Language Coverage: Benchmarks such as MKQA (Longpre et al., 2020) and MLQA (Lewis et al., 2019) construct thousands of question–answer (QA) pairs with parallel alignment across typologically diverse languages. MKQA covers 26 languages from 14 branches, while MLQA offers 7-language, 4-way-parallel QA spanning Arabic, German, Spanish, Hindi, Vietnamese, and Chinese.
Domain and Task Diversity: Benchmarks target open-domain QA (MKQA, MLQA, XQuAD), domain-specific contexts (EXAMS (Hardalov et al., 2020), L3Cube-IndicQuest (Rohera et al., 2024), DZEN (Hosain et al., 24 May 2025) for science/education), table reasoning (M3TQA (Shu et al., 22 Aug 2025), MULTITAT (Zhang et al., 24 Feb 2025)), chart understanding (PolyChartQA (Xu et al., 16 Jul 2025)), cultural VQA (CVQA (Romero et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)), audio-visual QA (AVQA (Phukan et al., 2024)), and locale-sensitive open-domain ODQA (XLQA (Roh et al., 22 Aug 2025)).
Translation and Quality Control: Many benchmarks employ high-quality translation pipelines with human experts or state-of-the-art LLMs, followed by back-translation and manual validation to minimize semantic drift (see M3TQA (Shu et al., 22 Aug 2025), PolyChartQA (Xu et al., 16 Jul 2025), XLQA (Roh et al., 22 Aug 2025), QALD-9-plus (Perevalov et al., 2022)). Quality thresholds are often set via BLEU (e.g., median BLEU=60.19 for M3TQA) or METEOR.
Cultural and Regional Grounding: CVQA (Romero et al., 2024), L3Cube-IndicQuest (Rohera et al., 2024), and Afri-MCQA (Tonja et al., 9 Jan 2026) emphasize local expertise, cultural diversity, and representation of both global and regional knowledge, in contrast to translation-only extensions.

2. Evaluation Frameworks and Metrics

Rigorous evaluation protocols in multilingual QA benchmarks ensure reproducibility and allow comparison across systems and languages. Core metrics include:

Metric	Definition	Typical Use
Exact Match (EM)	$\text{EM} = \tfrac{1}{N} \sum_{i=1}^N \mathbf{1}[\hat a_i = a_i]$	Span/match questions
Token-level F1	Precision, recall, and F1 on answer tokens (see MLQA/MKQA)	Partial credit + fuzziness
ROUGE-L	LCS-based recall + precision; $\text{ROUGE-L} = ...$	Generative/abstractive QA
BLEU	n-gram overlap (esp. for generative answer evaluation)	Text and speech QA
Relaxed Numeric Acc.	$\text{Acc}_{\text{relaxed}} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[\|\hat{y}_i - y_i\| \leq 0.05 y_i]$	Chart/table reasoning
Accuracy	Fraction of correct predictions (MC-choice, classification)	Multiple-choice, VQA, KGQA
MRR, P@k	Ranking metrics (product QA, question retrieval)	Cross-market, retrieval QA

Benchmarks may employ both reference-based (EM, F1, ROUGE, BLEU) and “judge LLM” metrics (L3Cube-IndicQuest (Rohera et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)). For open-ended, speech, or culturally sensitive tasks, human-in-the-loop or LLM-judge scoring may complement automatic measures.

3. Key Cross-Lingual and Cross-Cultural Challenges

Language Resource Imbalance: Benchmarks such as MKQA (Longpre et al., 2020), PolyChartQA (Xu et al., 16 Jul 2025), and Indic QA (Singh et al., 2024) demonstrate that performance on high-resource languages far exceeds that on low-resource or non-Latin-script languages, often by 10–50 F1 points.
Modality-Specific Gaps: Multimodal (visual, audio, table, chart) QA benchmarks (PolyChartQA (Xu et al., 16 Jul 2025), CVQA (Romero et al., 2024), MTVQA (Tang et al., 2024), AVQA (Phukan et al., 2024), M3TQA (Shu et al., 22 Aug 2025)) reveal that existing models are not robust to script variance and complex visual-linguistic alignment. For instance, PolyChartQA reports sharp accuracy drops on Bengali and Urdu (non-Latin, low-resource), and MTVQA observes accuracy of 30% vs. 80%+ for English-only evaluation.
Locale Awareness and Cultural Sensitivity: XLQA (Roh et al., 22 Aug 2025) explicitly annotates and benchmarks locale-sensitive vs. locale-invariant questions, with LLMs showing 10–30 F1 point drops on locale-sensitive types. CVQA and Afri-MCQA further demonstrate that models often fail on culturally grounded questions or images.
Impact of Translation Quality: Back-translation filtering, semantic consistency checks, and automatic/human verification are standard protocols. Despite rigorous pipelines, translation-based benchmarks still report degradation, especially for morphologically complex or low-resource languages.

4. Representative Benchmarks

Benchmark	Modalities	# Languages	Scale	Task Types	Key Features
MKQA (Longpre et al., 2020)	Text	26	10k × 26 = 260k	Open-domain QA, retrieval free	Wikidata entity linking; parallel queries
MLQA (Lewis et al., 2019)	Text	7	12.7k EN, ~5k/other	Extractive QA on Wiki contexts	4-way parallel, reference-aligned
PolyChartQA (Xu et al., 16 Jul 2025)	Chart images	10	22,606 charts, 26k QAs	Chart-based VQA, 16 chart types	Decoupled translation+render, METEOR QC
M3TQA (Shu et al., 22 Aug 2025)	Tables	97	2,916 QA pairs	Numerical, extraction, verification	12 families, 6-step LLM + BLEU QC
MULTITAT (Zhang et al., 24 Feb 2025)	Table+Text	11	250 parallel inst.	Span, arithmetic, count	Prompt-based baseline, error taxonomy
CVQA (Romero et al., 2024)	Images	31	4,560 imgs, 9,044 Q	MC-VQA, open-ended VQA	Culturally authored, high script diversity
Afri-MCQA (Tonja et al., 9 Jan 2026)	Vision, audio	15 (Africa)	7,500 MCQ + audio	MC-VQA, open/speech-based	Text, speech, cultural focus, English/nat.
AVQA (Phukan et al., 2024)	Video + audio	8	45–57k QA pairs/lng	Existential, location, temporal	Frozen encoder fusion, MT + human QC
EXAMS (Hardalov et al., 2020)	Text	16	24,143 MCQs	Multi-subject, cross-lingual MCQ	High school exams, 8 families, 24 subjects
XLQA (Roh et al., 22 Aug 2025)	Text	8	24,000 QAs	Open-domain, locale-sensitive/inv.	LLM-based translation + locale annotation
MTVQA (Tang et al., 2024)	Doc/scene img	9	6,778 QAs/test	Text recovery, reasoning (TEC-VQA)	Fully manual align, visual-text focus

These benchmarks span text, multimodal, and cross-domain QA, and include specializations for product QA (MCPQA (Yuan et al., 2024)), knowledge-graph QA (QALD-9-plus (Perevalov et al., 2022)), factual/abstractive QA (Indic QA (Singh et al., 2024)), and resource/actionable gaps for extremely low-resource settings (DZEN (Hosain et al., 24 May 2025)).

5. Empirical Results and Model Comparisons

Performance Gaps: Across nearly all benchmarks, state-of-the-art LLMs (e.g., GPT-4o, Gemini, Qwen-series) excel in English but lag 10–50 F1 points on low-resource and non-Latin-script languages (Longpre et al., 2020, Xu et al., 16 Jul 2025, Phukan et al., 2024, Rohera et al., 2024, Singh et al., 2024, Tonja et al., 9 Jan 2026).
Zero-Shot vs. Instruction/Few-Shot: Instruction tuning and few-shot paradigms provide moderate gains, while translation-augmented pipelines yield relative boosts, but do not eliminate gaps (Singh et al., 2024, Hosain et al., 24 May 2025, Zhang et al., 24 Feb 2025).
Prompt Engineering and Locale Injection: In XLQA (Roh et al., 22 Aug 2025), explicit locale cues in prompts can improve performance by up to +25 F1 for Japanese, but risk stereotype amplification.
Cultural and Region-Specific Gaps: L3Cube-IndicQuest (Rohera et al., 2024) shows that regional/cultural questions expose wider gaps, particularly in geography or history for Indic and African low-resource languages.
Modality-Specific Patterns: For chart and multimodal QA, vision–LLMs (Gemini, GPT-4o) outperform open-weight models, but both underperform on charts and images with local scripts, region-specific icons, or unfamiliar scene structure (Xu et al., 16 Jul 2025, Romero et al., 2024).

6. Methodological and Practical Implications

Benchmarking Protocols: Multi-level QC—incorporating both automatic and human checks, back-translation, and error taxonomy—is now standard in high-quality benchmarks (M3TQA (Shu et al., 22 Aug 2025), PolyChartQA (Xu et al., 16 Jul 2025), XLQA (Roh et al., 22 Aug 2025), QALD-9-plus (Perevalov et al., 2022)).
Task Diversity: Beyond extractive QA, current benchmarks probe aggregation (arithmetic), open-ended generation, factual verification, locale/culture specificity, and cross-market/domain transfer (MCPQA (Yuan et al., 2024)).
Assessment Paradigms: Dual scoring using reference (EM, F1, ROUGE) and LLM-based or human-judge ratings (e.g., for factuality, conciseness, cultural adherence) allows granular error analysis (L3Cube-IndicQuest (Rohera et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)).
Cross-Lingual Model Development: Analysis in M3TQA (Shu et al., 22 Aug 2025) and MULTITAT (Zhang et al., 24 Feb 2025) demonstrates that synthetic, LLM-generated multilingual data and instruction tuning can improve zero-shot performance, especially for low-resource scripts. However, cross-lingual linking and modality alignment remain core bottlenecks.

7. Open Challenges and Future Directions

Equitable Expansion: Most recent benchmarks aim to close geolinguistic imbalance by dramatically expanding language and script coverage (M3TQA: 97 languages, 12 families (Shu et al., 22 Aug 2025)). Coverage of endangered, regional, and African languages has become more common (Afri-MCQA (Tonja et al., 9 Jan 2026), CVQA (Romero et al., 2024)).
Cultural Robustness: Systematic frameworks (XLQA (Roh et al., 22 Aug 2025), CVQA (Romero et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)) probe sensitivity to regional entities, implicit culture, and stereotype risk.
Multimodal and Multitask Probing: The shift to complex modalities (table, chart, cross-modal retrieval, document/scene images, speech) calls for specialized architectures, culturally aware pretraining, and evaluation metrics tolerant of OCR/tokenizer errors and regional data drift (Xu et al., 16 Jul 2025, Zhang et al., 24 Feb 2025, Tang et al., 2024).
Evaluation Metrics Extension: Beyond EM/F1/ROUGE, proposed metrics include relaxed numeric accuracy, Jaccard, macro- vs. micro-averaging, and potential cultural consistency or region-aware rewards (Roh et al., 22 Aug 2025).
Human-in-the-Loop Verification: For high-stakes or locale-sensitive QA, benchmarks advocate for manual checks and LLM-as-judge validation (Roh et al., 22 Aug 2025, Rohera et al., 2024, Tonja et al., 9 Jan 2026).
Cross-market and Cross-resource Transfer: Benchmarks such as MCPQA (Yuan et al., 2024) experimentally demonstrate that cross-market retrieval drastically boosts performance for low-resource language markets.
Speech and Audio: Multimodal speech–vision–language QA benchmarks (AVQA (Phukan et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)) identify speech recognition (ASR/WER), language identification (LID), and cultural grounding as primary bottlenecks.

A plausible implication is that future research in multilingual QA should prioritize active data collection by native speakers, expansion of synthetic yet human-verified corpora, modality-specific adaptation, and robust cross-lingual/cross-modal evaluation protocols. The field continues to move toward benchmarks that simultaneously assess equity, cultural competence, and technical rigor across the global language spectrum.