Multilingual QA Benchmarks
- Multilingual QA Benchmarks are standardized tasks that assess question answering systems across diverse languages and modalities.
- They utilize rigorous dataset construction, translation pipelines, and evaluation metrics like EM, F1, and BLEU for cross-lingual comparability.
- Emerging research focuses on overcoming cultural, low-resource, and modality-specific challenges to improve global QA performance.
Multilingual Question Answering Benchmarks provide standardized tasks and corpora for evaluating question answering (QA) systems across multiple languages and modalities. These benchmarks address the need for rigorous, comparative assessment of cross-lingual transfer, cultural understanding, and domain-specific reasoning, going beyond English-centric benchmarks to cover underrepresented scripts, low-resource languages, and cross-modalities (text, tables, images, audio, speech). The following sections detail the design principles, dataset construction methodologies, evaluation protocols, cross-lingual challenges, representative benchmarks, and emerging research directions.
1. Design Principles and Dataset Construction
Multilingual QA benchmarks are characterized by careful language selection, answer normalization, and annotation pipelines designed to guarantee cross-lingual comparability:
- Parallel Data and Language Coverage: Benchmarks such as MKQA (Longpre et al., 2020) and MLQA (Lewis et al., 2019) construct thousands of question–answer (QA) pairs with parallel alignment across typologically diverse languages. MKQA covers 26 languages from 14 branches, while MLQA offers 7-language, 4-way-parallel QA spanning Arabic, German, Spanish, Hindi, Vietnamese, and Chinese.
- Domain and Task Diversity: Benchmarks target open-domain QA (MKQA, MLQA, XQuAD), domain-specific contexts (EXAMS (Hardalov et al., 2020), L3Cube-IndicQuest (Rohera et al., 2024), DZEN (Hosain et al., 24 May 2025) for science/education), table reasoning (M3TQA (Shu et al., 22 Aug 2025), MULTITAT (Zhang et al., 24 Feb 2025)), chart understanding (PolyChartQA (Xu et al., 16 Jul 2025)), cultural VQA (CVQA (Romero et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)), audio-visual QA (AVQA (Phukan et al., 2024)), and locale-sensitive open-domain ODQA (XLQA (Roh et al., 22 Aug 2025)).
- Translation and Quality Control: Many benchmarks employ high-quality translation pipelines with human experts or state-of-the-art LLMs, followed by back-translation and manual validation to minimize semantic drift (see M3TQA (Shu et al., 22 Aug 2025), PolyChartQA (Xu et al., 16 Jul 2025), XLQA (Roh et al., 22 Aug 2025), QALD-9-plus (Perevalov et al., 2022)). Quality thresholds are often set via BLEU (e.g., median BLEU=60.19 for M3TQA) or METEOR.
- Cultural and Regional Grounding: CVQA (Romero et al., 2024), L3Cube-IndicQuest (Rohera et al., 2024), and Afri-MCQA (Tonja et al., 9 Jan 2026) emphasize local expertise, cultural diversity, and representation of both global and regional knowledge, in contrast to translation-only extensions.
2. Evaluation Frameworks and Metrics
Rigorous evaluation protocols in multilingual QA benchmarks ensure reproducibility and allow comparison across systems and languages. Core metrics include:
| Metric | Definition | Typical Use |
|---|---|---|
| Exact Match (EM) | Span/match questions | |
| Token-level F1 | Precision, recall, and F1 on answer tokens (see MLQA/MKQA) | Partial credit + fuzziness |
| ROUGE-L | LCS-based recall + precision; | Generative/abstractive QA |
| BLEU | n-gram overlap (esp. for generative answer evaluation) | Text and speech QA |
| Relaxed Numeric Acc. | Chart/table reasoning | |
| Accuracy | Fraction of correct predictions (MC-choice, classification) | Multiple-choice, VQA, KGQA |
| MRR, P@k | Ranking metrics (product QA, question retrieval) | Cross-market, retrieval QA |
Benchmarks may employ both reference-based (EM, F1, ROUGE, BLEU) and “judge LLM” metrics (L3Cube-IndicQuest (Rohera et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)). For open-ended, speech, or culturally sensitive tasks, human-in-the-loop or LLM-judge scoring may complement automatic measures.
3. Key Cross-Lingual and Cross-Cultural Challenges
- Language Resource Imbalance: Benchmarks such as MKQA (Longpre et al., 2020), PolyChartQA (Xu et al., 16 Jul 2025), and Indic QA (Singh et al., 2024) demonstrate that performance on high-resource languages far exceeds that on low-resource or non-Latin-script languages, often by 10–50 F1 points.
- Modality-Specific Gaps: Multimodal (visual, audio, table, chart) QA benchmarks (PolyChartQA (Xu et al., 16 Jul 2025), CVQA (Romero et al., 2024), MTVQA (Tang et al., 2024), AVQA (Phukan et al., 2024), M3TQA (Shu et al., 22 Aug 2025)) reveal that existing models are not robust to script variance and complex visual-linguistic alignment. For instance, PolyChartQA reports sharp accuracy drops on Bengali and Urdu (non-Latin, low-resource), and MTVQA observes accuracy of 30% vs. 80%+ for English-only evaluation.
- Locale Awareness and Cultural Sensitivity: XLQA (Roh et al., 22 Aug 2025) explicitly annotates and benchmarks locale-sensitive vs. locale-invariant questions, with LLMs showing 10–30 F1 point drops on locale-sensitive types. CVQA and Afri-MCQA further demonstrate that models often fail on culturally grounded questions or images.
- Impact of Translation Quality: Back-translation filtering, semantic consistency checks, and automatic/human verification are standard protocols. Despite rigorous pipelines, translation-based benchmarks still report degradation, especially for morphologically complex or low-resource languages.
4. Representative Benchmarks
| Benchmark | Modalities | # Languages | Scale | Task Types | Key Features |
|---|---|---|---|---|---|
| MKQA (Longpre et al., 2020) | Text | 26 | 10k × 26 = 260k | Open-domain QA, retrieval free | Wikidata entity linking; parallel queries |
| MLQA (Lewis et al., 2019) | Text | 7 | 12.7k EN, ~5k/other | Extractive QA on Wiki contexts | 4-way parallel, reference-aligned |
| PolyChartQA (Xu et al., 16 Jul 2025) | Chart images | 10 | 22,606 charts, 26k QAs | Chart-based VQA, 16 chart types | Decoupled translation+render, METEOR QC |
| M3TQA (Shu et al., 22 Aug 2025) | Tables | 97 | 2,916 QA pairs | Numerical, extraction, verification | 12 families, 6-step LLM + BLEU QC |
| MULTITAT (Zhang et al., 24 Feb 2025) | Table+Text | 11 | 250 parallel inst. | Span, arithmetic, count | Prompt-based baseline, error taxonomy |
| CVQA (Romero et al., 2024) | Images | 31 | 4,560 imgs, 9,044 Q | MC-VQA, open-ended VQA | Culturally authored, high script diversity |
| Afri-MCQA (Tonja et al., 9 Jan 2026) | Vision, audio | 15 (Africa) | 7,500 MCQ + audio | MC-VQA, open/speech-based | Text, speech, cultural focus, English/nat. |
| AVQA (Phukan et al., 2024) | Video + audio | 8 | 45–57k QA pairs/lng | Existential, location, temporal | Frozen encoder fusion, MT + human QC |
| EXAMS (Hardalov et al., 2020) | Text | 16 | 24,143 MCQs | Multi-subject, cross-lingual MCQ | High school exams, 8 families, 24 subjects |
| XLQA (Roh et al., 22 Aug 2025) | Text | 8 | 24,000 QAs | Open-domain, locale-sensitive/inv. | LLM-based translation + locale annotation |
| MTVQA (Tang et al., 2024) | Doc/scene img | 9 | 6,778 QAs/test | Text recovery, reasoning (TEC-VQA) | Fully manual align, visual-text focus |
These benchmarks span text, multimodal, and cross-domain QA, and include specializations for product QA (MCPQA (Yuan et al., 2024)), knowledge-graph QA (QALD-9-plus (Perevalov et al., 2022)), factual/abstractive QA (Indic QA (Singh et al., 2024)), and resource/actionable gaps for extremely low-resource settings (DZEN (Hosain et al., 24 May 2025)).
5. Empirical Results and Model Comparisons
- Performance Gaps: Across nearly all benchmarks, state-of-the-art LLMs (e.g., GPT-4o, Gemini, Qwen-series) excel in English but lag 10–50 F1 points on low-resource and non-Latin-script languages (Longpre et al., 2020, Xu et al., 16 Jul 2025, Phukan et al., 2024, Rohera et al., 2024, Singh et al., 2024, Tonja et al., 9 Jan 2026).
- Zero-Shot vs. Instruction/Few-Shot: Instruction tuning and few-shot paradigms provide moderate gains, while translation-augmented pipelines yield relative boosts, but do not eliminate gaps (Singh et al., 2024, Hosain et al., 24 May 2025, Zhang et al., 24 Feb 2025).
- Prompt Engineering and Locale Injection: In XLQA (Roh et al., 22 Aug 2025), explicit locale cues in prompts can improve performance by up to +25 F1 for Japanese, but risk stereotype amplification.
- Cultural and Region-Specific Gaps: L3Cube-IndicQuest (Rohera et al., 2024) shows that regional/cultural questions expose wider gaps, particularly in geography or history for Indic and African low-resource languages.
- Modality-Specific Patterns: For chart and multimodal QA, vision–LLMs (Gemini, GPT-4o) outperform open-weight models, but both underperform on charts and images with local scripts, region-specific icons, or unfamiliar scene structure (Xu et al., 16 Jul 2025, Romero et al., 2024).
6. Methodological and Practical Implications
- Benchmarking Protocols: Multi-level QC—incorporating both automatic and human checks, back-translation, and error taxonomy—is now standard in high-quality benchmarks (M3TQA (Shu et al., 22 Aug 2025), PolyChartQA (Xu et al., 16 Jul 2025), XLQA (Roh et al., 22 Aug 2025), QALD-9-plus (Perevalov et al., 2022)).
- Task Diversity: Beyond extractive QA, current benchmarks probe aggregation (arithmetic), open-ended generation, factual verification, locale/culture specificity, and cross-market/domain transfer (MCPQA (Yuan et al., 2024)).
- Assessment Paradigms: Dual scoring using reference (EM, F1, ROUGE) and LLM-based or human-judge ratings (e.g., for factuality, conciseness, cultural adherence) allows granular error analysis (L3Cube-IndicQuest (Rohera et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)).
- Cross-Lingual Model Development: Analysis in M3TQA (Shu et al., 22 Aug 2025) and MULTITAT (Zhang et al., 24 Feb 2025) demonstrates that synthetic, LLM-generated multilingual data and instruction tuning can improve zero-shot performance, especially for low-resource scripts. However, cross-lingual linking and modality alignment remain core bottlenecks.
7. Open Challenges and Future Directions
- Equitable Expansion: Most recent benchmarks aim to close geolinguistic imbalance by dramatically expanding language and script coverage (M3TQA: 97 languages, 12 families (Shu et al., 22 Aug 2025)). Coverage of endangered, regional, and African languages has become more common (Afri-MCQA (Tonja et al., 9 Jan 2026), CVQA (Romero et al., 2024)).
- Cultural Robustness: Systematic frameworks (XLQA (Roh et al., 22 Aug 2025), CVQA (Romero et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)) probe sensitivity to regional entities, implicit culture, and stereotype risk.
- Multimodal and Multitask Probing: The shift to complex modalities (table, chart, cross-modal retrieval, document/scene images, speech) calls for specialized architectures, culturally aware pretraining, and evaluation metrics tolerant of OCR/tokenizer errors and regional data drift (Xu et al., 16 Jul 2025, Zhang et al., 24 Feb 2025, Tang et al., 2024).
- Evaluation Metrics Extension: Beyond EM/F1/ROUGE, proposed metrics include relaxed numeric accuracy, Jaccard, macro- vs. micro-averaging, and potential cultural consistency or region-aware rewards (Roh et al., 22 Aug 2025).
- Human-in-the-Loop Verification: For high-stakes or locale-sensitive QA, benchmarks advocate for manual checks and LLM-as-judge validation (Roh et al., 22 Aug 2025, Rohera et al., 2024, Tonja et al., 9 Jan 2026).
- Cross-market and Cross-resource Transfer: Benchmarks such as MCPQA (Yuan et al., 2024) experimentally demonstrate that cross-market retrieval drastically boosts performance for low-resource language markets.
- Speech and Audio: Multimodal speech–vision–language QA benchmarks (AVQA (Phukan et al., 2024), Afri-MCQA (Tonja et al., 9 Jan 2026)) identify speech recognition (ASR/WER), language identification (LID), and cultural grounding as primary bottlenecks.
A plausible implication is that future research in multilingual QA should prioritize active data collection by native speakers, expansion of synthetic yet human-verified corpora, modality-specific adaptation, and robust cross-lingual/cross-modal evaluation protocols. The field continues to move toward benchmarks that simultaneously assess equity, cultural competence, and technical rigor across the global language spectrum.