Hallucinations Leaderboard

Updated 19 February 2026

Hallucinations leaderboards are structured evaluation platforms that systematically quantify unsupported outputs in generative models using standardized datasets and robust metrics.
They employ diverse methodologies, including adversarial prompts and taxonomic annotations, to precisely identify error types and model weaknesses.
Leaderboard results guide enhancements in model tuning and retrieval augmentation, aiming to mitigate the risk of hallucinated claims.

A hallucinations leaderboard is a formal evaluation platform designed to systematically quantify, benchmark, and compare the propensity of LLMs, large vision-LLMs (LVLMs), and multimodal systems to generate hallucinated outputs—i.e., plausible-sounding but unsupported or false claims—across a wide range of tasks, modalities, and domains. Modern leaderboards provide not only aggregated performance rankings but also decompose error types, highlight contextual and task-based weaknesses, and incorporate robust, reproducible measurement methodologies that allow both academic and industrial practitioners to assess and reduce hallucination risk in deployed systems.

1. Hallucination Definitions and Taxonomies

Hallucinations are formally defined within two complementary frameworks—faithfulness and factuality. Faithfulness hallucinations are outputs inconsistent with a provided input context, whereas factuality hallucinations are outputs that contradict established world knowledge (Hong et al., 2024). Distinct benchmarks further refine this by introducing intrinsic vs. extrinsic hallucinations: intrinsic hallucinations contradict or extrapolate beyond the given context (e.g., source document, video frame), while extrinsic hallucinations introduce new information unsupported both by context and by training data (Bang et al., 24 Apr 2025). For LVLMs, a crucial distinction is between Type I hallucinations (open-ended, free-form generations that invent unsupported content) and Type II hallucinations (restricted-answer errors in response to narrow, fact-seeking prompts) (Kaul et al., 2024). Adversarial, real-world, and domain-specific leaderboards further expand the taxonomy to include agentic hallucinations (unfaithful actions in complex agentic environments) (Zhang et al., 28 Jul 2025), modal-specific phenomena (e.g. phonetic, semantic, lexical errors in ASR (Koudounas et al., 18 Oct 2025); temporal, attribute, object/interactions in video (Choong et al., 2024); knowledge base alignment errors in OM (Qiang et al., 25 Mar 2025)), and fine-grained typologies such as omission, missed deduction, invented, and extrinsic-correct (Pesiakhovsky et al., 26 Sep 2025).

2. Benchmarking Methodologies and Scoring Functions

Hallucinations leaderboards implement unified pipelines comprising standardized datasets, automated or human-in-the-loop annotation, and precisely defined metrics. Key structural and procedural features include:

Dataset Construction: Manual or automated generation of challenging (often adversarial) prompts, context–response pairs, and multi-level grounding requirements. Typical datasets include both synthetic (e.g., extrinsic entity creation (Bang et al., 24 Apr 2025)), natural (real-world LLM–user dialogues (Ren et al., 12 Oct 2025)), and domain-specific corpora (video, speech, multimodal, or low-resource languages (Wang et al., 2024, Koudounas et al., 18 Oct 2025, Nguyen et al., 8 Jan 2026)).
Taxonomic Annotation: Category labels for hallucination types, span-level or free-form error localization, and differentiation between eligible vs. ineligible generations (Jacovi et al., 6 Jan 2025, Pesiakhovsky et al., 26 Sep 2025).
Evaluation Workflow: Involves either manual expert annotation, LLM-as-a-Judge paradigms (using ensembles of automated judges with validated templates), or black-box meta-regressors that fuse probabilistic, self-consistency, and disagreement signals from multiple base detectors (Mehta et al., 2024, Jacovi et al., 6 Jan 2025).

Metrics are always task and taxonomy specific, but include (with canonical LaTeX forms):

Accuracy / Hallucination Rate: $\text{HallucinationRate} = \frac{\#\,\text{hallucinated outputs}}{N}$ (Zhu et al., 2024).
Precision, Recall, F1: Standard forms; F1 on hallucination detection or span-level error matching (Pesiakhovsky et al., 26 Sep 2025).
NDCG and Pairwise Accuracy (for ranking hallucination severity): NDCG-based scoring for ordering caption sets by hallucination severity, pairwise accuracy for adversarial VideoQA (Choong et al., 2024, Wang et al., 2024).
Refusal/Acceptance Rates: For extrinsic hallucination detection, False Acceptance/Refusal rates are directly penalized (Bang et al., 24 Apr 2025).
Aggregated Scores: Weighted or averaged subscores across multiple benchmarks and tasks, with normalization to [0,1] for comparability (Bang et al., 24 Apr 2025).
Task-Specific Metrics: Lexical/phonetic/morphological/semantic error axes for ASR hallucination (Koudounas et al., 18 Oct 2025), macro-F1 for multiclass detection (as in Vietnamese ViHallu) (Nguyen et al., 8 Jan 2026).

3. Leaderboard Structure, Model Ranking, and Comparative Analysis

Well-architected leaderboards enumerate results per model across all relevant benchmarks and dimensions. Scoring functions are model- and task-specific, with model ranks determined by principal metrics such as macro-F1 (multi-class detection) (Nguyen et al., 8 Jan 2026), mean AUROC for RAG hallucination discrimination (Sardana, 27 Mar 2025), mean NDCG for caption ordering (Choong et al., 2024), F1 on free-form hallucination matching (Pesiakhovsky et al., 26 Sep 2025), and aggregated normalized scores for unified frameworks (Bang et al., 24 Apr 2025). The leaderboard structure is especially critical for comparing closed- and open-source LLMs, RLHF- vs. instruction-tuned variants, native-multimodal vs. unimodal models, and risk-conservative ("abstain early") vs. risk-seeking ("recall maximizing") system behaviors (Qiang et al., 25 Mar 2025, Hong et al., 2024).

Consistent observations from leading studies include:

Closed-source models (e.g., GPT-4, Gemini-Pro) exhibit lower hallucination rates and higher grounding fidelity in both adversarial and domain-specific evaluations (Zhu et al., 2024, Seth et al., 18 Aug 2025, Jacovi et al., 6 Jan 2025).
Instruction-tuning improves faithfulness but can degrade open-domain factuality; model scaling improves both axes but with diminishing returns on adversarial or real-world prompts (Hong et al., 2024).
Ensemble strategies, structured prompting, and Chain-of-Thought (CoT) reasoning yield improved detection and localization performance (Pesiakhovsky et al., 26 Sep 2025, Nguyen et al., 8 Jan 2026).
Multimodal and agentic settings expose unique hallucination types that may not be captured by standard language-only leaderboards; see persistent gaps in temporal, auditory, and action grounding (Choong et al., 2024, Zhang et al., 28 Jul 2025, Seth et al., 18 Aug 2025).

4. Key Benchmarks and Reference Leaderboards

Several state-of-the-art leaderboards and benchmark frameworks encapsulate contemporary best practices:

FACTS Grounding Leaderboard: Evaluates long-form LLM grounding against 32k-token contexts using an ensemble of automated judges with consensus eligibility and span-level factuality checks (Jacovi et al., 6 Jan 2025).
THRONE: Quantifies object hallucinations in both open-ended (Type I) and restricted QA (Type II) LVLM generations, including class-wise F₀.₅ as a primary metric; demonstrates that progress on one type does not transfer to the other (Kaul et al., 2024).
HalluLens: Unifies extrinsic and intrinsic hallucination tasks with dynamic test-set generation, task normalization, and weighted aggregation into overall model scores to mitigate leakage and overfitting (Bang et al., 24 Apr 2025).
FaithJudge Leaderboard: Replaces legacy fine-tuned detectors with an LLM-as-judge approach using few-shot human annotations, achieving higher agreement with human hallucination labels in RAG summarization and QA (Tamber et al., 7 May 2025).
MIRAGE-Bench: Assesses agentic hallucinations in interactive LLM-based agents by scoring action faithfulness to instruction, history, and observation context, with a utility function and Hallucination Rate as core metrics (Zhang et al., 28 Jul 2025).
SHALLOW (ASR): Quantifies speech hallucination across lexical, phonetic, morphological, and semantic axes; demonstrates WER-obscured failure analysis under degraded conditions (Koudounas et al., 18 Oct 2025).
ViHallu Challenge: Establishes a macro-F1 leaderboard in Vietnamese for no/intrinsic/extrinsic hallucinations under factual, noisy, and adversarial prompts, emphasizing the utility of structured prompting and ensemble adapters (Nguyen et al., 8 Jan 2026).
MetaCheckGPT (SemEval SHROOM-6): Treats hallucination detection as a meta-regression over diverse model uncertainties, outperforming black-box detection by 15–30 points (Mehta et al., 2024).
HaluEval-Wild and AuthenHallu: Introduce real user-LLM dialogue benchmarks to quantify “in-the-wild” hallucination propensities, revealing high error rates for open-source and even advanced closed-source models on adversarial queries (Zhu et al., 2024, Ren et al., 12 Oct 2025).

5. Leaderboard Maintenance: Best Practices and Practical Considerations

Operationalizing a robust hallucinations leaderboard requires strict adherence to dynamic dataset regeneration (to prevent saturation and gaming), transparent versioning, and open-sourcing of scoring scripts and prompt templates (Bang et al., 24 Apr 2025). Models must be rerun and re-evaluated on each new data seed or major checkpoint update to ensure comparability. Blind test splits and limited public release of exact evaluation items guard against overfitting and data leakage (Jacovi et al., 6 Jan 2025). Benchmarks that rely on LLM-as-judge mechanisms must be periodically validated against human-annotated subsets to detect drift in automatic evaluation quality (Pesiakhovsky et al., 26 Sep 2025, Tamber et al., 7 May 2025). Multimodal and cross-linguistic expansions—such as audio-visual, agentic, and low-resource language benchmarks—are prioritized to ensure ecologically valid and globally relevant outcomes (Seth et al., 18 Aug 2025, Nguyen et al., 8 Jan 2026). Leaderboards may also provide confidence calibration metrics and encourage abstention or “don’t know” handling to minimize high-risk, unsupported outputs (Nakamizo et al., 16 Oct 2025).

6. Practical Implications and Current Limitations

Hallucinations leaderboards enable model developers and application designers to select systems with verifiable grounding profiles, tailor them to domain constraints, and benchmark hallucination reduction strategies through prompt engineering, instruction tuning, or retrieval augmentation (Hong et al., 2024, Sardana, 27 Mar 2025, Kabongo et al., 2024). Persistent limitations remain: current detector F1s for hallucination localization in authentic dialogues plateau at 64–67%, even with ensemble and reasoning-augmented LLMs (Ren et al., 12 Oct 2025, Pesiakhovsky et al., 26 Sep 2025), and both intrinsic and extrinsic hallucinations resist simple scaling and fine-tuning interventions (Kaul et al., 2024, Wang et al., 2024). Category- and domain-specific vulnerabilities (e.g., temporal/causal hallucinations in video, faithfulness errors in Vietnamese, object hallucinations in multimodal QA, semantic category confusion in ASR) demand continued expansion and refinement of both taxonomies and detection pipelines (Choong et al., 2024, Koudounas et al., 18 Oct 2025, Nguyen et al., 8 Jan 2026).

Future directions involve hybrid symbolic–neural verification, dynamic retrieval-augmented judge models, broader multilingual coverage, formalization of error localization, and multi-turn, interactive evaluation protocols crossing modalities and agent/LLM boundaries (Pesiakhovsky et al., 26 Sep 2025, Zhang et al., 28 Jul 2025, Seth et al., 18 Aug 2025, Wang et al., 2024).

The hallucinations leaderboard, in its contemporary instantiations, has become a multi-factorial, fine-grained, and dynamically evolving platform for measuring, diagnosing, and driving mitigation of unsupported model outputs across the generative modeling landscape (Hong et al., 2024, Jacovi et al., 6 Jan 2025, Kaul et al., 2024, Bang et al., 24 Apr 2025, Pesiakhovsky et al., 26 Sep 2025, Zhang et al., 28 Jul 2025).