Robustness of confessions under adversarial pressure
Ascertain whether models trained to produce ConfessionReport outputs remain honest under adversarial pressure by confessing when prompted to pursue a secret goal or when trained to exhibit malign behavior, particularly when confession training is performed only on out-of-distribution data.
References
Aside from iterating on our training methodology to improve confession performance, there are a number of open questions we would like to investigate in future work. If we prompt the model to achieve a secret goal (as in \citet{li2025spillingbeansteachingllms}) or train in a malign behavior, will the models still confess even if we only perform confession training on out-of-distribution data?
— Training LLMs for Honesty via Confessions
(2512.08093 - Joglekar et al., 8 Dec 2025) in Section 9.5 (Future work)