Cause of limited improvement on factuality/hallucination confessions

Establish whether the lack of improvement in confession accuracy during training on hallucination and factuality evaluations is caused by models being genuinely mistaken in their original answers and consequently repeating the same mistakes in their confessions; rigorously test this conjectured explanation for the observed training behavior.

Background

The authors observe that confession accuracy improves with training across many evaluations but not for hallucination/factuality tasks involving people and general knowledge, where performance sometimes regresses.

They conjecture that in these domains, incorrect answers stem from genuine belief rather than intentional misbehavior, leading models to repeat the same errors in confessions. Validating this causal explanation would clarify the boundaries of confession training’s effectiveness.

References

We conjecture that this is because in those evaluations,it is often the case that when a model responds with a wrong answer, it is because it is genuinely mistaken, and hence it is likely to repeat the same mistake in confessions as well.

— Training LLMs for Honesty via Confessions (2512.08093 - Joglekar et al., 8 Dec 2025) in Section 4.2 (RL training improves confessions)

Cause of limited improvement on factuality/hallucination confessions

Background

References

Related Problems