FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG

Published 9 Jan 2026 in cs.CL | (2601.05866v1)

Abstract: Retrieval-Augmented Generation (RAG) models are critically undermined by citation hallucinations, a deceptive failure where a model confidently cites a source that fails to support its claim. Existing work often attributes hallucination to a simple over-reliance on the model's parametric knowledge. We challenge this view and introduce FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms), a framework of four mechanistic scores measuring the distinct contributions of a model's attention and FFN pathways, and the alignment between them. Our analysis reveals two consistent signatures of correct citation: a significantly stronger contribution from the model's parametric knowledge and greater use of the attention sink for information synthesis. Crucially, we find the signature of a correct citation is not static but evolves with model scale. For example, the signature of a correct citation for the Llama-3.2-3B model is marked by higher pathway alignment, whereas for the Llama-3.1-8B model, it is characterized by lower alignment, where pathways contribute more distinct, orthogonal information. By capturing this complex, evolving signature, FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Our findings reframe citation hallucination as a complex, scale-dependent interplay between internal mechanisms, paving the way for more nuanced and reliable RAG systems.

Abstract PDF Upgrade to Chat

Summary

The paper introduces FACTUM, a framework that diagnoses citation hallucination by analyzing transformer attention and FFN pathways during citation generation in long-form RAG.
It defines four novel scores—CAS, BAS, PFS, and PAS—that quantify semantic alignment, attention distribution, and parametric recall to assess citation reliability.
Empirical results demonstrate that FACTUM outperforms prior methods by up to 37.5% in AUC and reveals scale-dependent mechanistic signatures across different model sizes.

Mechanistic Detection of Citation Hallucination in Long-Form Retrieval-Augmented Generation

Motivation and Problem Definition

Citation hallucination in Retrieval-Augmented Generation (RAG) models represents a critical failure mode: models confidently attribute claims to sources that do not substantiate or even directly contradict those claims. This error undermines the core promise of verifiability in RAG, especially in long-form scenarios where attributional drift becomes prevalent as a model synthesizes content across vast context windows (up to millions of tokens). Notably, citation hallucination erodes user trust since citations are heuristically treated as marks of reliability, regardless of their factual correspondence.

Figure 1: Citation hallucination can render even factually accurate statements unverifiable and misleading by misattributing claims to incorrect sources.

Despite its importance, citation hallucination has received limited attention compared to broader RAG hallucination. Conventional approaches typically treat hallucination as a binary problem tied to over-reliance on parametric knowledge, yet fail to mechanistically differentiate between proper and improper citation. Existing detection methods are predominantly black-box or rely on token-level uncertainty, overlooking the transformer’s internal attribution mechanisms. This work proposes FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms) to diagnose citation hallucination by probing the underlying architectural pathways in transformer LLMs.

FACTUM Framework Overview

FACTUM leverages the separation of information streams in transformer architecture—specifically, Attention (external context reader) and Feed-Forward Network (FFN, internal parametric recall)—to disentangle the mechanistic contributions at a citation token. Four novel scores are introduced:

Context Alignment Score (CAS): Directly quantifies the citation token’s semantic alignment with retrieved source documents using an attention-weighted aggregation. High CAS indicates contextual grounding.
Beginning-of-Sentence Attention Score (BAS): Measures attention allocation to the "attention sink" token (e.g., <s>), an emergent mechanism for global synthesis in long contexts. High BAS correlates with improved information integration.
Parametric Force Score (PFS): Captures the magnitude of FFN-induced modification to the residual stream; proxies the model’s reliance on parametric knowledge. High PFS during citation can indicate confident recall of fact.
Pathway Alignment Score (PAS): Assesses geometric alignment (cosine similarity) between attention and FFN updates. High PAS signals coordinated use of both pathways; low or negative PAS reflects dissonance or orthogonal processing.

The holistic FACTUM diagnostic operates at the moment of citation generation, measuring these four signals to distinguish reliable from hallucinated citations.

Figure 2: FACTUM dissects the model’s internal state at the citation token, computing pathway-specific and interaction scores.

Experimental Validation

Task and Dataset

The evaluation targets single-citation, long-form RAG on NeuCLIR 2024—a challenging report generation task with 15-context documents. Citations are annotated for factual correspondence using a robust LLM-as-a-judge protocol, validated for stability and human-alignment using Cohen’s Kappa ( $\kappa = 0.68$ ).

Models and Feature Engineering

Two models are assessed: Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct, both with extended context capability. Raw FACTUM scores (thousands per token) are distilled via model-component pruning and time-series/statistical reductions, yielding concise features for interpretable classifiers (Logistic Regression, LightGBM, EBM).

Results

FACTUM decisively surpasses prior mechanistic baselines (ReDeEP’s ECS and PKS) and uncertainty-based methods. For instance, Logistic Regression with FACTUM features attains an AUC of 0.715 (3B) and 0.737 (8B), outperforming all rivals by up to 37.5% in AUC. General baselines such as P(True) and Perplexity regressed toward low precision despite reasonable recall, resulting in unreliable practical detection.

Mechanistic Signatures

FACTUM uncovers model-scale-dependent signatures for correct citations:

3B Model: Correct citations coincide with higher CAS, BAS, PAS, and PFS—internal pathways work in coordination for proper attribution.
8B Model: Peak detection requires top-25% specialized components; correct attribution is marked by high PFS and BAS, but PAS tends to be lower, indicating orthogonal, complementary information processing rather than straightforward pathway agreement.

This evidences a shift with scale from "pathway agreement" to "specialized, orthogonal contribution," refuting the notion that correct citation is simply tied to increased parametric knowledge utilization.

Practical and Theoretical Implications

Reliable Citation Detection: FACTUM’s pathway-based diagnostic enables token-level detection of citation hallucination, providing an actionable verifiability safeguard for RAG outputs in real-time.
Mechanistic Interpretability: By mapping distinct attribution failures and coordination signatures, FACTUM facilitates deeper model audits and interpretability beyond simplistic behavioral probes.
Long-context Model Design: The scale-dependent mechanistic patterns uncovered suggest future RAG architectures might selectively enhance certain pathway operations, such as adaptive attention sink utilization or context-parametric interplay.
Externally-Visible Trust Indicators: Scores such as PAS or BAS could be operationalized as warning indicators (e.g., citation asterisks), helping users gauge citation reliability in generated texts.

Speculation on Future Developments

Mechanistic frameworks like FACTUM could be integrated into model training and inference pipelines to dynamically enforce attribution integrity. Quantitative detection mechanisms might be coupled with retrieval or generative models to actively mitigate attributional drift, especially in compositional or multi-hop settings. Moreover, fine-grained pathway diagnostics will likely become essential for regulatory and academic applications requiring auditability of automated claims.

Conclusion

FACTUM establishes a new state-of-the-art for citation hallucination detection by exploiting transformer pathway mechanisms, revealing nuanced, scale-dependent signatures of factual attribution. Its diagnostic capability reframes citation reliability as an emergent property of coordinated internal processing, advocating for interpretability-driven advances in RAG model trustworthiness and design.