Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multilingual QA Benchmarks

Updated 28 January 2026
  • Multilingual QA Benchmarks are standardized tasks that assess question answering systems across diverse languages and modalities.
  • They utilize rigorous dataset construction, translation pipelines, and evaluation metrics like EM, F1, and BLEU for cross-lingual comparability.
  • Emerging research focuses on overcoming cultural, low-resource, and modality-specific challenges to improve global QA performance.

Multilingual Question Answering Benchmarks provide standardized tasks and corpora for evaluating question answering (QA) systems across multiple languages and modalities. These benchmarks address the need for rigorous, comparative assessment of cross-lingual transfer, cultural understanding, and domain-specific reasoning, going beyond English-centric benchmarks to cover underrepresented scripts, low-resource languages, and cross-modalities (text, tables, images, audio, speech). The following sections detail the design principles, dataset construction methodologies, evaluation protocols, cross-lingual challenges, representative benchmarks, and emerging research directions.

1. Design Principles and Dataset Construction

Multilingual QA benchmarks are characterized by careful language selection, answer normalization, and annotation pipelines designed to guarantee cross-lingual comparability:

2. Evaluation Frameworks and Metrics

Rigorous evaluation protocols in multilingual QA benchmarks ensure reproducibility and allow comparison across systems and languages. Core metrics include:

Metric Definition Typical Use
Exact Match (EM) EM=1Ni=1N1[a^i=ai]\text{EM} = \tfrac{1}{N} \sum_{i=1}^N \mathbf{1}[\hat a_i = a_i] Span/match questions
Token-level F1 Precision, recall, and F1 on answer tokens (see MLQA/MKQA) Partial credit + fuzziness
ROUGE-L LCS-based recall + precision; ROUGE-L=...\text{ROUGE-L} = ... Generative/abstractive QA
BLEU n-gram overlap (esp. for generative answer evaluation) Text and speech QA
Relaxed Numeric Acc. Accrelaxed=1Ni=1N1[y^iyi0.05yi]\text{Acc}_{\text{relaxed}} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}[|\hat{y}_i - y_i| \leq 0.05 y_i] Chart/table reasoning
Accuracy Fraction of correct predictions (MC-choice, classification) Multiple-choice, VQA, KGQA
MRR, P@k Ranking metrics (product QA, question retrieval) Cross-market, retrieval QA

Benchmarks may employ both reference-based (EM, F1, ROUGE, BLEU) and “judge LLM” metrics (L3Cube-IndicQuest (Rohera et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)). For open-ended, speech, or culturally sensitive tasks, human-in-the-loop or LLM-judge scoring may complement automatic measures.

3. Key Cross-Lingual and Cross-Cultural Challenges

  • Language Resource Imbalance: Benchmarks such as MKQA (Longpre et al., 2020), PolyChartQA (Xu et al., 16 Jul 2025), and Indic QA (Singh et al., 2024) demonstrate that performance on high-resource languages far exceeds that on low-resource or non-Latin-script languages, often by 10–50 F1 points.
  • Modality-Specific Gaps: Multimodal (visual, audio, table, chart) QA benchmarks (PolyChartQA (Xu et al., 16 Jul 2025), CVQA (Romero et al., 2024), MTVQA (Tang et al., 2024), AVQA (Phukan et al., 2024), M3TQA (Shu et al., 22 Aug 2025)) reveal that existing models are not robust to script variance and complex visual-linguistic alignment. For instance, PolyChartQA reports sharp accuracy drops on Bengali and Urdu (non-Latin, low-resource), and MTVQA observes accuracy of 30% vs. 80%+ for English-only evaluation.
  • Locale Awareness and Cultural Sensitivity: XLQA (Roh et al., 22 Aug 2025) explicitly annotates and benchmarks locale-sensitive vs. locale-invariant questions, with LLMs showing 10–30 F1 point drops on locale-sensitive types. CVQA and Afri-MCQA further demonstrate that models often fail on culturally grounded questions or images.
  • Impact of Translation Quality: Back-translation filtering, semantic consistency checks, and automatic/human verification are standard protocols. Despite rigorous pipelines, translation-based benchmarks still report degradation, especially for morphologically complex or low-resource languages.

4. Representative Benchmarks

Benchmark Modalities # Languages Scale Task Types Key Features
MKQA (Longpre et al., 2020) Text 26 10k × 26 = 260k Open-domain QA, retrieval free Wikidata entity linking; parallel queries
MLQA (Lewis et al., 2019) Text 7 12.7k EN, ~5k/other Extractive QA on Wiki contexts 4-way parallel, reference-aligned
PolyChartQA (Xu et al., 16 Jul 2025) Chart images 10 22,606 charts, 26k QAs Chart-based VQA, 16 chart types Decoupled translation+render, METEOR QC
M3TQA (Shu et al., 22 Aug 2025) Tables 97 2,916 QA pairs Numerical, extraction, verification 12 families, 6-step LLM + BLEU QC
MULTITAT (Zhang et al., 24 Feb 2025) Table+Text 11 250 parallel inst. Span, arithmetic, count Prompt-based baseline, error taxonomy
CVQA (Romero et al., 2024) Images 31 4,560 imgs, 9,044 Q MC-VQA, open-ended VQA Culturally authored, high script diversity
Afri-MCQA (Tonja et al., 9 Jan 2026) Vision, audio 15 (Africa) 7,500 MCQ + audio MC-VQA, open/speech-based Text, speech, cultural focus, English/nat.
AVQA (Phukan et al., 2024) Video + audio 8 45–57k QA pairs/lng Existential, location, temporal Frozen encoder fusion, MT + human QC
EXAMS (Hardalov et al., 2020) Text 16 24,143 MCQs Multi-subject, cross-lingual MCQ High school exams, 8 families, 24 subjects
XLQA (Roh et al., 22 Aug 2025) Text 8 24,000 QAs Open-domain, locale-sensitive/inv. LLM-based translation + locale annotation
MTVQA (Tang et al., 2024) Doc/scene img 9 6,778 QAs/test Text recovery, reasoning (TEC-VQA) Fully manual align, visual-text focus

These benchmarks span text, multimodal, and cross-domain QA, and include specializations for product QA (MCPQA (Yuan et al., 2024)), knowledge-graph QA (QALD-9-plus (Perevalov et al., 2022)), factual/abstractive QA (Indic QA (Singh et al., 2024)), and resource/actionable gaps for extremely low-resource settings (DZEN (Hosain et al., 24 May 2025)).

5. Empirical Results and Model Comparisons

6. Methodological and Practical Implications

  • Benchmarking Protocols: Multi-level QC—incorporating both automatic and human checks, back-translation, and error taxonomy—is now standard in high-quality benchmarks (M3TQA (Shu et al., 22 Aug 2025), PolyChartQA (Xu et al., 16 Jul 2025), XLQA (Roh et al., 22 Aug 2025), QALD-9-plus (Perevalov et al., 2022)).
  • Task Diversity: Beyond extractive QA, current benchmarks probe aggregation (arithmetic), open-ended generation, factual verification, locale/culture specificity, and cross-market/domain transfer (MCPQA (Yuan et al., 2024)).
  • Assessment Paradigms: Dual scoring using reference (EM, F1, ROUGE) and LLM-based or human-judge ratings (e.g., for factuality, conciseness, cultural adherence) allows granular error analysis (L3Cube-IndicQuest (Rohera et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)).
  • Cross-Lingual Model Development: Analysis in M3TQA (Shu et al., 22 Aug 2025) and MULTITAT (Zhang et al., 24 Feb 2025) demonstrates that synthetic, LLM-generated multilingual data and instruction tuning can improve zero-shot performance, especially for low-resource scripts. However, cross-lingual linking and modality alignment remain core bottlenecks.

7. Open Challenges and Future Directions

  • Equitable Expansion: Most recent benchmarks aim to close geolinguistic imbalance by dramatically expanding language and script coverage (M3TQA: 97 languages, 12 families (Shu et al., 22 Aug 2025)). Coverage of endangered, regional, and African languages has become more common (Afri-MCQA (Tonja et al., 9 Jan 2026), CVQA (Romero et al., 2024)).
  • Cultural Robustness: Systematic frameworks (XLQA (Roh et al., 22 Aug 2025), CVQA (Romero et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)) probe sensitivity to regional entities, implicit culture, and stereotype risk.
  • Multimodal and Multitask Probing: The shift to complex modalities (table, chart, cross-modal retrieval, document/scene images, speech) calls for specialized architectures, culturally aware pretraining, and evaluation metrics tolerant of OCR/tokenizer errors and regional data drift (Xu et al., 16 Jul 2025, Zhang et al., 24 Feb 2025, Tang et al., 2024).
  • Evaluation Metrics Extension: Beyond EM/F1/ROUGE, proposed metrics include relaxed numeric accuracy, Jaccard, macro- vs. micro-averaging, and potential cultural consistency or region-aware rewards (Roh et al., 22 Aug 2025).
  • Human-in-the-Loop Verification: For high-stakes or locale-sensitive QA, benchmarks advocate for manual checks and LLM-as-judge validation (Roh et al., 22 Aug 2025, Rohera et al., 2024, Tonja et al., 9 Jan 2026).
  • Cross-market and Cross-resource Transfer: Benchmarks such as MCPQA (Yuan et al., 2024) experimentally demonstrate that cross-market retrieval drastically boosts performance for low-resource language markets.
  • Speech and Audio: Multimodal speech–vision–language QA benchmarks (AVQA (Phukan et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)) identify speech recognition (ASR/WER), language identification (LID), and cultural grounding as primary bottlenecks.

A plausible implication is that future research in multilingual QA should prioritize active data collection by native speakers, expansion of synthetic yet human-verified corpora, modality-specific adaptation, and robust cross-lingual/cross-modal evaluation protocols. The field continues to move toward benchmarks that simultaneously assess equity, cultural competence, and technical rigor across the global language spectrum.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multilingual Question Answering Benchmarks.