Factors driving performance differences between confessions and chain-of-thought monitors
Identify which protocol differences—training the monitor, sharing weights between actor and monitor, and presenting the rollout as a prefix versus embedding it in a user message—are responsible for performance differences between confession-based monitoring and chain-of-thought monitoring across settings.
References
Aside from iterating on our training methodology to improve confession performance, there are a number of open questions we would like to investigate in future work. In this paper, we did not empirically compare the efficacy of confessions to CoT monitors. Future work should test this more directly. Confessions differ from traditional CoT monitoring in three ways: (1) the monitor is trained, (2) the actor and monitor models have the same weights, and (3) the rollout is provided as a prefix to conversation the monitor sees, rather than embedded in a user message. It would be informative to understand which of these differences are responsible for better or worse monitor performance in different settings.