Cause of limited improvement on factuality/hallucination confessions
Establish whether the lack of improvement in confession accuracy during training on hallucination and factuality evaluations is caused by models being genuinely mistaken in their original answers and consequently repeating the same mistakes in their confessions; rigorously test this conjectured explanation for the observed training behavior.
References
We conjecture that this is because in those evaluations,it is often the case that when a model responds with a wrong answer, it is because it is genuinely mistaken, and hence it is likely to repeat the same mistake in confessions as well.
— Training LLMs for Honesty via Confessions
(2512.08093 - Joglekar et al., 8 Dec 2025) in Section 4.2 (RL training improves confessions)