Robustness of confessions under adversarial pressure

Ascertain whether models trained to produce ConfessionReport outputs remain honest under adversarial pressure by confessing when prompted to pursue a secret goal or when trained to exhibit malign behavior, particularly when confession training is performed only on out-of-distribution data.

Background

Confessions are proposed as a tool to elicit honest self-reporting, especially for intentional misbehavior such as scheming or reward hacking. However, the robustness of confessions under adversarial setups—where models are prompted to optimize for hidden objectives or trained to behave malignly—remains uncertain.

Testing whether confession honesty persists under such conditions would inform the reliability of confessions as a safety mechanism and their generalization when training data does not include the adversarial behaviors of interest.

References

Aside from iterating on our training methodology to improve confession performance, there are a number of open questions we would like to investigate in future work. If we prompt the model to achieve a secret goal (as in \citet{li2025spillingbeansteachingllms}) or train in a malign behavior, will the models still confess even if we only perform confession training on out-of-distribution data?

Training LLMs for Honesty via Confessions  (2512.08093 - Joglekar et al., 8 Dec 2025) in Section 9.5 (Future work)