Generalization of confessions with imperfect judges

Construct settings that falsify the empirical assumption that honest reporting is incentivized even when the confession judge is hackable, and determine whether restricting confession training to environments with ground-truth-augmented judges improves or worsens generalization compared to training with a weak grader across all environments.

Background

Confession rewards are computed by an LLM judge, which can itself be hackable. The paper’s results suggest confessions remain honest even when the answer reward model is hackable, but the authors highlight that this relies on assumptions about the confession judge and the ease of honest reporting.

They propose studying whether there are scenarios where this assumption fails and whether training confessions only in environments with ground truth accessible to the judge yields better or worse generalization than using a weak grader universally.

References

Aside from iterating on our training methodology to improve confession performance, there are a number of open questions we would like to investigate in future work. One of our key empirical assumptions is that honest reporting is incentivized even when the confession judge is hackable; can we construct settings where this is false? Would confession performance improve or worsen if we were to only train on confessions in select environments where we can give the judge access to ground truth?

Training LLMs for Honesty via Confessions  (2512.08093 - Joglekar et al., 8 Dec 2025) in Section 9.5 (Future work)