Papers
Topics
Authors
Recent
Search
2000 character limit reached

Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

Published 12 Jan 2026 in cs.CL and cs.AI | (2601.07422v1)

Abstract: Despite their impressive capabilities, LLMs frequently generate hallucinations. Previous work shows that their internal states encode rich signals of truthfulness, yet the origins and mechanisms of these signals remain unclear. In this paper, we demonstrate that truthfulness cues arise from two distinct information pathways: (1) a Question-Anchored pathway that depends on question-answer information flow, and (2) an Answer-Anchored pathway that derives self-contained evidence from the generated answer itself. First, we validate and disentangle these pathways through attention knockout and token patching. Afterwards, we uncover notable and intriguing properties of these two mechanisms. Further experiments reveal that (1) the two mechanisms are closely associated with LLM knowledge boundaries; and (2) internal representations are aware of their distinctions. Finally, building on these insightful findings, two applications are proposed to enhance hallucination detection performance. Overall, our work provides new insight into how LLMs internally encode truthfulness, offering directions for more reliable and self-aware generative systems.

Summary

  • The paper demonstrates that LLMs signal truthfulness through two mechanisms: Q-Anchored (question-driven) and A-Anchored (answer-driven) encoding.
  • The authors use saliency analysis, attention knockout, and token patching experiments to validate distinct pathway sensitivities in truthfulness encoding.
  • The pathway-aware applications, including Mixture-of-Probes and Pathway Reweighting, significantly enhance hallucination detection and model reliability.

Two Pathways to Truthfulness in LLMs: Mechanistic Insights and Implications for Hallucination Detection

Overview and Motivation

The paper "Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations" (2601.07422) conducts a rigorous analysis of how LLMs encode internal signals of truthfulness and hallucination. While prior work has demonstrated that probing internal model states can recover reliable truthfulness signals, the mechanisms underlying such representations have remained opaque. This work provides a comprehensive mechanistic characterization via a series of interpretability experiments, revealing two fundamentally distinct pathways by which LLMs encode truthfulness: the Question-Anchored (Q-Anchored) pathway and the Answer-Anchored (A-Anchored) pathway.

Mechanistic Dissection of Hallucination Encoding

Saliency Analysis and Bimodal Mechanisms

Initial experiments employ saliency-driven interpretability to measure information flow from question tokens to answer tokens in QA settings. Kernel density estimation of saliency scores discloses a bimodal distribution, suggesting two atypical regimes: instances with negligible question-to-answer flow and others with significant dependence. Figure 1

Figure 1: Kernel density estimates of saliency-score distributions for question-to-answer flows; clear bimodality evidences two mechanisms.

This bimodality postulates two diverging modes of internal truthfulness encoding: one that relies on information transfer from questions (Q-Anchored) and another largely independent of the input context (A-Anchored).

Attention Knockout and Patching Experiments

To validate this dichotomy, attention knockout interventions selectively suppress attention weights from exact question tokens to downstream answer positions. Probing classifier predictions exhibit bifurcated responses: Q-Anchored samples undergo marked prediction shifts, while A-Anchored samples are robust to the intervention. Figure 2

Figure 2

Figure 2: ΔP\Delta \mathrm{P} under attention knockout, subdivided by pathway, with sharp probability changes for Q-Anchored samples and stability for A-Anchored.

Complementary token patching experiments inject hallucinatory cues into questions, confirming that Q-Anchored encodings are sensitive to such corruption, unlike A-Anchored, whose predictions persist.

Source Attribution: The A-Anchored Pathway

To dissect the information source for A-Anchored encoding, "answer-only" experiments exclude the question altogether, forwarding only model-generated answers. Consistently, Q-Anchored samples display substantial prediction drift, whereas A-Anchored samples remain invariant, indicating self-contained evidence in the answer representation.

Knowledge Boundaries and Self-Awareness

Pathway Association with Knowledge Boundaries

Cross-dataset analysis shows that Q-Anchored encoding predominantly occurs for high-certainty, in-domain factual queries (with higher accuracy), while A-Anchored encoding is overrepresented for long-tail, low-support entities lying outside the model’s knowledge boundary. Figure 3

Figure 3

Figure 3

Figure 3: Comparisons of answer accuracy between pathways, evidencing Q-Anchored dominance in high-accuracy regimes.

Entity popularity analysis further substantiates that Q-Anchored activation is correlated with storage of parametric knowledge about popular entities, while A-Anchored is invoked in cases of parametric deficiency.

Intrinsic Pathway Awareness

The model’s hidden states themselves can reliably discriminate which encoding pathway is active. Probing classifiers trained to predict pathway mode achieve high AUC on multiple datasets and architectures, substantiating genuine mechanistic self-awareness.

Pathway-Aware Applications: Advancing Hallucination Detection

Mixture-of-Probes Architecture

Leveraging pathway specialization, the Mixture-of-Probes (MoP) architecture routes samples via a gating network to expert probes trained on each pathway type. This configuration substantially outperforms generic probing and ablated ensemble variants (with up to 10% AUC improvement over strong baselines).

Pathway Reweighting

Pathway Reweighting (PR) adaptively amplifies attention between answer and question tokens based on predicted pathway membership, thereby enhancing the saliency of truthfulness signals most germane to hallucination detection. This plug-and-play technique yields further AUC gains across a spectrum of datasets and model scales. Figure 4

Figure 4: Comparisons of answer accuracy between pathways, probing mlp activations of the last exact answer token.

Aggregate Results

Both MoP and PR methods demonstrate consistent superiority over uncertainty-based (semantic entropy, logit metrics), self-consistency, and vanilla probing approaches, generalizing robustly across model architectures and QA formats.

Implications and Prospects

This work elucidates key properties of LLM internal truthfulness encoding:

  • The presence of mechanistically distinct Q-Anchored and A-Anchored pathways resolves ambiguities in latent knowledge representation and offers clear mappings to parametric knowledge boundaries.
  • Intrinsic model awareness of pathway selection enables fine-grained, dynamically specialized hallucination detection, a capability markedly superior to monolithic probing.
  • Pathway specialization provides practical handles for developing intervention, analysis, and downstream reliability controls, both at inference-time and within model design.

Prospectively, dissecting further how these pathways arise during model pretraining, instruction tuning, and RL alignment may suggest principled strategies to suppress hallucinations. Moreover, generalizing pathway-aware techniques to tasks beyond QA (e.g., dialogue, summarization) would expand the utility and theoretical reach of this mechanistic framework.

Conclusion

The paper presents a detailed, empirical dissection of truthfulness encoding in LLMs, revealing synergistic but distinct Q-Anchored and A-Anchored pathways, each aligned to knowledge boundaries and associated with explicit patterns of hallucination sensitivity. By harnessing pathway self-awareness, the authors achieve measurable advances in hallucination detection efficacy, establishing a foundation for robust, interpretable generative modeling. These findings inform both practical techniques for reliability and broader theoretical models of LLM internal computation.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 21 likes about this paper.