Test whether aligned rewards improve monitorability
Determine whether aligned Chain-of-Thought reward terms, such as process supervision that penalizes incorrect reasoning steps in the Heads-And-Tails environment, actively improve Chain-of-Thought monitorability in settings where baseline monitorability is not already perfect.
References
Note that we could not test the frameworkâs prediction that these rewards actively improve monitorability, since monitorability is already perfect at the start of training in our setting.
— Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
(2603.30036 - Kaufmann et al., 31 Mar 2026) in Section 5: Empirical Results (Aligned rewards improved performance in Heads-And-Tails)