Mu-SHROOM at SemEval-2025
- The paper introduces Mu-SHROOM, a novel multilingual shared task that formulates hallucination detection as a precise span-labeling problem.
- It presents a comprehensive annotation scheme and robust evaluation metrics to assess fact-consistency across 14 typologically diverse languages.
- Methodological innovations, including retrieval-augmented generation and prompt-driven paradigms, demonstrate significant improvements in detection performance.
Mu-SHROOM is a multilingual shared task introduced at SemEval-2025, addressing the detection and localization of hallucinations and overgeneration mistakes in outputs from instruction-tuned LLMs. The challenge frames hallucination detection as a span-labeling problem at character or token level in fourteen typologically diverse languages. The formalization, comprehensive annotation scheme, and robust evaluation metrics make Mu-SHROOM the most ambitious factuality detection benchmark to date, encompassing 2,618 system submissions from 43 international teams (Vázquez et al., 16 Apr 2025).
1. Motivation and Definition
LLMs frequently generate fluent but factually incorrect statements, termed hallucinations, and may add unsupported details—overgeneration mistakes—that erode user trust. Mu-SHROOM defines hallucinations as content in an LLM answer that is unsupported by, or directly contradicts, an authoritative Wikipedia page associated with the reference question. Overgeneration is operationalized as content extending beyond the scope of the source/prompt or introducing irrelevant/excessive detail. By reframing detection as a precise span selection problem, Mu-SHROOM enables evaluation of span-level detection systems and supports research into fact-consistency across languages (Vázquez et al., 16 Apr 2025, Bala et al., 25 Mar 2025).
2. Dataset Design and Annotation Protocols
The Mu-SHROOM dataset was constructed by selecting 762 multilingual Wikipedia pages, generating knowledge-intensive closed questions per page, and answering them using outputs from 38 open-weight instruction-tuned LLMs with diverse hyperparameters. From these, one answer per question per language was manually selected for annotation.
Annotation protocols required human annotators to highlight the minimal spans (at character or token granularity) that would need to be edited or deleted for the answer to be fully supported by the reference Wikipedia page. Key points include:
- Gold annotation was limited to the provided Wikipedia page, with auxiliary cross-page consultation encouraged but tracked.
- Token-level majority voting determined hard-span gold labels; annotatorwise overlap produced soft character scores reflecting disagreement.
- Inter-annotator agreement was measured using IoU: for annotator , , with aggregate agreement being the mean over annotators.
- Annotation quality varied cross-lingually (mean IoU: English ≈0.49, Spanish ≈0.58, Italian up to 0.87), reflecting linguistic complexity and pool size differences.
- The split includes 50 validation and 150 test items for ten standard languages and ≈100 test items for each surprise language (Vázquez et al., 16 Apr 2025).
3. Task Setup and Evaluation Metrics
Systems receive as input a language-specific question, a reference Wikipedia page, and an LLM-generated answer. The core outputs are:
- Hard labels: binary span predictions (character-level) for hallucinated regions.
- Soft labels: per-character (or per-token) probabilities indicating hallucination likelihood, reflecting annotator consensus.
Official metrics are:
- Intersection-over-Union (IoU): For binary span predictions versus gold spans (majority-over-annotators gold),
- Character-Level Spearman Correlation (p): Between model-predicted hallucination probabilities and empirical annotator probabilities, ( and are the gold and system probability vectors).
- Additional references include reporting of precision, recall, and F for span-level boundary detection (Vázquez et al., 16 Apr 2025, Bala et al., 25 Mar 2025).
4. Methodological Innovations
Mu-SHROOM’s open evaluation environment seeded a diverse ecosystem of system architectures, with two broad methodological axes dominating top results:
- Retrieval-Augmented Generation (RAG): Majority of submissions, including top systems such as UCSC (Huang et al., 5 May 2025), CCNU (Liu et al., 17 May 2025), HalluSearch (Abdallah et al., 14 Apr 2025), and TUM-MiKaNi (Anschütz et al., 1 Jul 2025), implement RAG pipelines. These approaches typically:
- Retrieve supporting context (e.g., Wikipedia, web) based on the question and answer.
- Decompose answers into “atomic” claims via LLM prompting or semantic role labeling.
- Validate claims against retrieved knowledge using prompt-engineered LLMs or NLI (Natural Language Inference) models.
- Localize hallucinations to character or token spans by mapping LLM-extracted discrepancies back to the source answer.
- Prompt-Driven and Self-Consistency Paradigms: Some systems (e.g., AILS-NTUA (Karkani et al., 4 Mar 2025), keepitsimple (Vemula et al., 23 May 2025), and NCL-UoR (Hong et al., 2 Mar 2025)) employ training-free pipelines where LLMs are prompted (often in few-shot settings) to reason over translations, hypotheses, or sampled alternative generations, with entropy or consensus serving as hallucination signals.
Other approaches span supervised token classifiers (HausaNLP with ModernBERT (Bala et al., 25 Mar 2025)), ensemble LLM adjudication (MSA (Hikal et al., 27 May 2025)), knowledge graph verification, and minimal cost revision via LLMs (Huang et al., 5 May 2025). The methodology landscape illustrates a spectrum from pure prompting with translation and self-consistency checks to complex ensemble systems combining external evidence retrieval and parametric learning.
5. Results and Empirical Insights
Mu-SHROOM received 2,618 submissions from 43 teams, with diverse methodological foci reflected in overall system performance and per-language rankings (Vázquez et al., 16 Apr 2025). Salient findings include:
- RAG Superiority: Retrieval-augmented pipelines achieved significantly higher IoU and correlation metrics (p < 10-59 and p < 10-39, respectively) than non-RAG systems, indicating the necessity of external grounding for reliable hallucination detection.
- Fine-Tuning vs. Prompting: While 36.9% of runs were prompt-only, these did not outperform models with supervised fine-tuning on IoU and trailed on soft-label correlation (p < 0.002), mirroring prior SHROOM findings.
- Cross-Lingual Variability: Top systems consistently beat the “mark-all” baseline (IoU ≈0.22) by 30–50 points in high-resource settings, yet overall performance still lags in morphologically complex or low-resource languages (e.g., Basque, Farsi, Chinese). Cross-system and cross-language performance exhibits heterogeneity, further exacerbated by annotation disagreement (Spearman ρ ≈0.2–0.35 between annotator IoU and system performance).
- Highlighted Top Systems: UCSC achieved the highest mean ranking across all languages (IoU and correlation), winning on multiple languages and leveraging prompt optimization atop a modular RAG architecture (Huang et al., 5 May 2025). Other high-performing frameworks include CCNU (role-diverse LLM ensembles, best in Hindi) (Liu et al., 17 May 2025), AILS-NTUA (few-shot multilingual prompting by translation) (Karkani et al., 4 Mar 2025), MSA (LLM adjudication with fuzzy span matching, best in Arabic and Basque) (Hikal et al., 27 May 2025), HalluSearch (zero-shot factual splitting + RAG) (Abdallah et al., 14 Apr 2025), and TUM-MiKaNi (retrieval-fused BERT+SVR ensembles) (Anschütz et al., 1 Jul 2025).
6. Analysis of System Errors and Open Challenges
Despite gains, Mu-SHROOM surfaced fundamental challenges in factuality detection:
- Annotation Ambiguity and Inter-Annotator Disagreement: Even with detailed guidelines, token-level span boundaries of hallucinations are often subjective, with dominant errors arising from minor boundary mismatches and the inherent ambiguity of unsupported multi-word claims (Vázquez et al., 16 Apr 2025).
- Retrieval Failure Cascades: Inaccurate or missing external reference retrieval propagates downstream, causing both missed hallucinations and false positives. Morphologically rich and low-web-coverage languages are particularly prone to context retrieval errors (e.g., Basque, Farsi) (Abdallah et al., 14 Apr 2025, Anschütz et al., 1 Jul 2025).
- Boundary Localization: Systems frequently struggle with short, often disconnected hallucinated spans, especially when hallucinations are tightly interleaved with supported content. Boundary “snapping” and post-processing (fuzzy matching, as in MSA (Hikal et al., 27 May 2025)) were critical for top performance.
- Metric Stability and Cross-Language Adaptivity: Bootstrapped analysis indicates many ranking differences between systems are not statistically robust, reflecting small test sets and annotation variability. This highlights a need for larger benchmarks and more robust calibration of detection confidence.
7. Directions for Future Research
Several key avenues emerge from the empirical synthesis of Mu-SHROOM:
- Mitigating Annotator Disagreement: Developing annotation frameworks and taxonomies to reduce subjectivity and unify hallucination definitions.
- Multilingual Expansion and Domain Adaptation: Extending benchmarks and methods to lower-resource and more typologically diverse languages, as well as domain-specific content (e.g., medical QA) (Vázquez et al., 16 Apr 2025, Anschütz et al., 1 Jul 2025).
- Retrieval Robustness and Calibration: Improving cross-lingual retrieval and integrating dense multilingual semantic retrievers to ensure comprehensive supporting context (Abdallah et al., 14 Apr 2025, Anschütz et al., 1 Jul 2025).
- Unified Detection–Mitigation Pipelines: Moving from detection to correction, including in-generation steering and dynamic retrieval-reasoning loops.
- Confidence Modeling: Calibrating probabilistic soft labels to reflect true hallucination severity and annotator certainty, potentially leveraging ensemble or Bayesian approaches (Liu et al., 17 May 2025, Hong et al., 2 Mar 2025).
- Ensembles and Open-Source Accessibility: Distilling gains from ensemble and LLM-adjudication frameworks into scalable, reproducible, open-source toolkits (Huang et al., 5 May 2025, Liu et al., 17 May 2025, Hikal et al., 27 May 2025).
Reference Table: Representative Top-Ranked Mu-SHROOM Systems
| System/Group | Core Approach | Multilingual Scope | Notable Result(s) | Reference |
|---|---|---|---|---|
| UCSC | Modular RAG + prompt opt | 14 languages | #1 avg. rank (IoU, Corr) | (Huang et al., 5 May 2025) |
| CCNU | Role-diverse LLM ensemble | 14 languages | #1 Hindi, Top-5 in 7 languages | (Liu et al., 17 May 2025) |
| MSA | LLM adjudication + fuzzy | 12 languages | #1 Arabic, Basque; top-3 in 8 languages | (Hikal et al., 27 May 2025) |
| AILS-NTUA | Prompt+translate+few-shot | 14 languages | IoU 0.753 Farsi, 0.587 Czech (first place) | (Karkani et al., 4 Mar 2025) |
| HalluSearch | Factual splitting + RAG | 14 languages | 4th English/Czech; high-coverage RAG | (Abdallah et al., 14 Apr 2025) |
| TUM-MiKaNi | RAG+BERT+SVR fusion | 14 languages | Top-10 in 8 languages; robust multilingual | (Anschütz et al., 1 Jul 2025) |
| NCL-UoR | Modified RefChecker/SelfCheckGPT | 14 languages | IoU=0.5310, Corr=0.5669 (avg.) | (Hong et al., 2 Mar 2025) |
Notes
- All factual, metric, and protocol details above are found verbatim or by exact paraphrase in the relevant Mu-SHROOM papers. Interpretations such as “the dominant error mode is minor boundary mismatches” are supported directly by ablation and error analyses in the task overview and individual system reports.
- The diversity of approaches and evaluation results underlines that reliable and calibrated multilingual hallucination detection remains a challenging problem for LLMs. The introduction and broad adoption of Mu-SHROOM represent a significant milestone for the fact-consistency community and offer a solid platform for further progress (Vázquez et al., 16 Apr 2025).