Test whether aligned rewards improve monitorability

Determine whether aligned Chain-of-Thought reward terms, such as process supervision that penalizes incorrect reasoning steps in the Heads-And-Tails environment, actively improve Chain-of-Thought monitorability in settings where baseline monitorability is not already perfect.

Background

The framework predicts that aligned rewards should improve transparency and potentially monitorability by encouraging reasoning that enhances outcome performance. However, in the authors' empirical setup, baseline monitorability was already perfect, preventing direct evaluation of improvements.

The authors explicitly note their inability to test this prediction in their setting, making verification of the predicted monitorability gains an open question.

References

Note that we could not test the framework’s prediction that these rewards actively improve monitorability, since monitorability is already perfect at the start of training in our setting.

— Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought? (2603.30036 - Kaufmann et al., 31 Mar 2026) in Section 5: Empirical Results (Aligned rewards improved performance in Heads-And-Tails)

Test whether aligned rewards improve monitorability

Background

References

Related Problems