Hallucination in Natural Language Generation

Updated 27 December 2025

Hallucination in NLG is the phenomenon where models generate fluent but factually unsupported or fabricated text, affecting reliability in critical applications.
Research distinguishes intrinsic hallucinations (direct contradictions with input) from extrinsic ones (unsupported content) using metrics like NLI and retrieval-based methods.
Mitigation strategies include retrieval-augmented generation, multi-stage pipeline systems, and uncertainty-aware decoding to substantially reduce factual errors.

Hallucination in Natural Language Generation refers to the phenomenon where LLMs generate text that is fluent and plausible but factually unsupported, incorrect, or outright fabricated. Hallucinations undermine the reliability of NLG systems in critical applications, as they can introduce spurious information and erode user trust, particularly in domains requiring high factual fidelity. The research literature addresses hallucination from multiple perspectives, including formal taxonomies, underlying causes, detection and mitigation methodologies, and open technical challenges.

1. Formal Taxonomy and Definitions

Hallucination is operationalized as the production of content not entailed or verifiable with respect to the input or relevant external knowledge. Two primary axes structure the definition:

Intrinsic hallucination: Output that explicitly contradicts or conflicts with the source input. Formally, for an input $x$ and output $y$ , an intrinsic hallucination occurs if there exists a span $s \subseteq y$ such that $supp(x,s)=0$ and $contrad(x,s)=1$ .
Extrinsic hallucination: Output that is not verifiably supported or contradicted by the input—content for which $supp(x,s)=0$ and $contrad(x,s)=0$ .

A comprehensive taxonomy contextualizes hallucination with fine granularity both in task scope and error type. The HAD taxonomy (Xu et al., 22 Oct 2025) and others (Alansari et al., 5 Oct 2025, Ji et al., 2022) distinguish:

Faithfulness Hallucinations: Task-type inconsistency, requirement inconsistency, contradiction with input, baseless information, omission, internal output contradiction, and structural incoherence.
Factuality Hallucinations: Fact recall errors, fact inference errors, fabricated entities, and fictional attribution.

These distinctions delineate failures along dimensions of consistency with instruction, input, and world knowledge.

2. Causes and Theoretical Limits

Hallucination originates from the interplay of data, architecture, training procedure, and inference mechanisms (Alansari et al., 5 Oct 2025, Kalavasis et al., 2024):

Data-related factors: Noisy, biased, or conflicting training corpora and inadequate coverage of long-tail or factual knowledge.
Model-level factors: Attention diffusion in long contexts, inductive biases of next-token prediction, and lack of negative examples for factuality discrimination.
Training and inference effects: Exposure bias from teacher forcing, misalignment of decodability and factuality due to standard likelihood objectives, and overconfidence from reward-based alignment tuning.
Decoding and prompt effects: Stochastic sampling, ambiguous prompts, and insufficient grounding at inference encourage unsupported generation.

Fundamental results formalize hallucination as an unavoidable property for broad model classes and task settings. For any computable LLM $h$ , there exists an infinite set of inputs $s$ for which $h(s)$ is not an acceptable factual output, via diagonalization (Suzuki et al., 15 Feb 2025). However, the probability measure of such hallucinations under any practical input distribution can be made arbitrarily small by sufficient data and robust algorithms, rendering hallucination "statistically negligible" in properly scoped deployments (Suzuki et al., 15 Feb 2025). A related impossibility result establishes that no generative learner can simultaneously guarantee both perfect factual consistency (no hallucination) and full output breadth (no mode collapse) across non-identifiable language classes, except by leveraging negative examples or explicit corrective feedback (Kalavasis et al., 2024).

3. Detection Methodologies

Detection of hallucination divides into several methodological categories, with approaches evolving alongside the rise of LLMs (Qi et al., 2024, Alansari et al., 5 Oct 2025, Xu et al., 22 Oct 2025):

A. Reference/Knowledge-Based Approaches

Natural Language Inference (NLI)-based Detectors: Model outputs are assessed by computing entailment and contradiction probabilities with respect to reference inputs (Kang et al., 2024). ENT and DIFF metrics computed via NLI models correlate well with human factuality judgments in high-resource languages (Pearson $r \approx 0.49$ for ENT), outperforming lexical metrics such as ROUGE or named-entity overlaps.
Retrieval-Augmented Checking: Claims or entities in the output are compared to retrieved evidence from external corpora or knowledge bases (e.g., RAG frameworks, FacTool, UFO) (Béchard et al., 2024, Alansari et al., 5 Oct 2025). Span-level detectors (FAVA, SBD) operate at token granularity.
Self-Consistency / Sampling-Based: Multiple generations are compared for semantic or factual consistency (SelfCheckGPT) (Alansari et al., 5 Oct 2025).

B. Uncertainty and Introspection-Based Approaches

Token-Level Predictive Uncertainty: Hallucination correlates with elevated predictive entropy, especially epistemic uncertainty (Xiao et al., 2021, Su et al., 2024). Real-time hallucination detectors (RHD) utilize per-entity token probability and entropy during generation (AUC $=89.31$ on WikiBio GPT-3) (Su et al., 2024).
Model Internals: Layerwise Relevance Propagation (LRP), attention analysis, and contribution staticity provide robust feature sets for lightweight hallucination detectors in NMT (Xu et al., 2023). LRP-based detectors achieved F1 $=81.2\%$ (De-En), AUC $=91.4$ %.

C. Supervised and Taxonomic Detection

Multi-Task and Fine-Grained Models: Dedicated LLMs (e.g., HAD) are trained for multi-class hallucination type detection, span localization, and correction, achieving SOTA performance (e.g., HAD-14B: binary accuracy $=89.1\%$ , span F1 $=76.0\%$ ) (Xu et al., 22 Oct 2025).

The limitations of these methods include weak performance in low-resource languages, brittleness to atomic-fact hallucinations, and limited explainability or localization in many detectors. Lexical overlap metrics are generally unreliable for real-world hallucination detection (Kang et al., 2024, Qi et al., 2024).

4. Mitigation Strategies

Mitigation is pursued through both modeling, data, and post-processing innovations:

Retrieval-Augmented Generation (RAG): Integrating dense retrievers with LLMs, RAG systems significantly reduce structured hallucinations (e.g., in workflow generation; hallucinated steps: $13.7\%\to1.9\%$ ) (Béchard et al., 2024). Dynamic retrieval triggered by real-time detection further improves efficiency and faithfulness (DRAD framework) (Su et al., 2024).
Agentic Multi-Stage Pipelines: Multi-agent systems orchestrated via natural-language-based interfaces (e.g., OVON framework) iteratively detect, flag, and rephrase speculative or unsupported content, using explicit markers and disclaimers. Each agent’s rationale is preserved in machine-readable “whisper” events, and quantitative Key Performance Indicators (KPIs) such as Factual-sounding Claim Density (FCD), Fictional Disclaimer Frequency (FDF), and Total Hallucination Score (THS) track mitigation progress. Empirically, multi-agent pipelines result in reductions exceeding $2800\%$ in aggregate THS over three agent layers (mean THS: $-0.0049 \to -0.1396$ ) (Gosmar et al., 19 Jan 2025).
Minimal-Edit Rewriting: Upon detection, targeted rewriting by powerful LLMs (e.g., GPT-4) corrects flagged sentences with high precision and reduced token budgets, achieving post-rewrite precision $>0.60$ and halving hallucinated sentences (Wang et al., 2024).
Controllable Generation: Conditioning models on hallucination “knobs” (discrete tokens or decoder branches) reduces hallucinated content with minimal drop in fluency (e.g., word-overlap hall. control raises faithfulness $+25$ pp in human evaluation) (Filippova, 2020, Rebuffel et al., 2021).
Uncertainty-Aware Decoding: Penalizing epistemic uncertainty during beam search achieves strong reductions in hallucination with modest standard metric trade-offs (Xiao et al., 2021).
RLHF and Feedback: Negative feedback (both during training and post-deployment), including human-in-the-loop annotation, is theoretically essential for achieving both factual consistency and expressive breadth in model outputs (Kalavasis et al., 2024).
Data Cleaning and Input Structuring: NLI-filtering and optimized input linearization eliminate both intrinsic and extrinsic hallucinations in data-to-text and chart summarization, improving value correctness rates by $+20$ pp, and reducing unsupported content by over half (Islam et al., 2023).

5. Quantification and Evaluation Metrics

A rich set of metrics assesses hallucination from multiple perspectives (Alansari et al., 5 Oct 2025, Qi et al., 2024):

NLI-Based Metrics: Sentence-level and claim-level entailment probability, difference (ENT, DIFF), and maximal contradiction score (CON) (Kang et al., 2024).
Lexical Overlap Metrics: ROUGE-n, Named Entity Overlap (NEO), PARENT for table-to-text (Kang et al., 2024, Ji et al., 2022). While easy to compute, these measures correlate poorly with faithfulness, especially under paraphrase or entity reordering.
QA-Based Metrics: QuestEval, QAFactEval, and similar pipelines combining question generation and answering over both summaries and sources yield precision, recall, and F1 over extracted answers.
Self-Consistency Scores: Inter-sample semantic or NLI agreement signals internal contradictions.
Human Evaluation: Granular annotation on faithfulness scales, span-level tagging (e.g., in UHGEval), and binary correctness on domain-specific axes (Liang et al., 2023).
KPIs in Agentic Systems: FCD, FGR (grounding references), FDF, ECS (context markers), and composite THS track the density of factual claims, grounding, disclaimers, and context explicitly (Gosmar et al., 19 Jan 2025).
Span and Class Metrics: Precision, recall, F1 at span or token level, as used in HADTest and benchmark competitions (Xu et al., 22 Oct 2025).

The cross-cutting limitations of automatic evaluation include propagation of errors through multi-stage pipelines, lack of fine-grained localization, high cost for LLM-based assessment, and poor coverage of world-factuality errors.

6. Challenges and Future Directions

Several technical frontiers remain (Qi et al., 2024, Alansari et al., 5 Oct 2025, Xu et al., 22 Oct 2025):

Interpretability: Developing detectors that provide span-level localization and actionable rationales for identified hallucinations.
Generalization and Coverage: Scaling detection and mitigation across low-resource languages, novel domains, and long-context or multi-turn tasks.
Efficiency and Scalability: Reconciling the cost of retrieval, verification, and rewriting modules with production latency and throughput constraints.
Hybrid and Human-in-the-Loop Systems: Combining lightweight detection heuristics with periodic human auditing and selectively solicited feedback is recommended for balancing reliability and scalability.
Unified Benchmarks and Taxonomies: The need for benchmark suites that cover both intrinsic and extrinsic hallucinations, span various fact granularities, and include task- and domain-specific criteria is emphasized (Qi et al., 2024, Xu et al., 22 Oct 2025).
Theoretical Foundations: Open challenges persist in formalizing the emergence of hallucination, quantifying the limits imposed by architecture and data, and developing learning protocols which leverage negative supervision effectively (Suzuki et al., 15 Feb 2025, Kalavasis et al., 2024).

The literature evidences rapid progress, but underscores that hallucination remains an open and multifaceted research challenge whose resolution requires advances in theory, model design, evaluation, and system engineering.