Factors driving performance differences between confessions and chain-of-thought monitors

Identify which protocol differences—training the monitor, sharing weights between actor and monitor, and presenting the rollout as a prefix versus embedding it in a user message—are responsible for performance differences between confession-based monitoring and chain-of-thought monitoring across settings.

Background

The paper positions confessions as complementary to chain-of-thought monitoring and notes key differences: the monitor is trained, the actor and monitor share weights, and the rollout is provided as a prefix rather than embedded in a user message.

The authors did not empirically compare confessions to CoT monitors and flag understanding the contributions of these differences as an open direction. Pinpointing which elements drive performance would guide design choices for effective monitoring systems.

References

Aside from iterating on our training methodology to improve confession performance, there are a number of open questions we would like to investigate in future work. In this paper, we did not empirically compare the efficacy of confessions to CoT monitors. Future work should test this more directly. Confessions differ from traditional CoT monitoring in three ways: (1) the monitor is trained, (2) the actor and monitor models have the same weights, and (3) the rollout is provided as a prefix to conversation the monitor sees, rather than embedded in a user message. It would be informative to understand which of these differences are responsible for better or worse monitor performance in different settings.

Training LLMs for Honesty via Confessions  (2512.08093 - Joglekar et al., 8 Dec 2025) in Section 9.5 (Future work)