Predict performance thresholds that trigger monitorability drops
Ascertain the performance thresholds on the outcome reward R_out and the Chain-of-Thought reward R_cot at which monitorability degradation is observed in practice during reinforcement learning post-training.
References
It is difficult to know what level of performance is likely to cause issues in practice. Hence, whether a drop in monitorability will actually be observed is unclear.
— Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
(2603.30036 - Kaufmann et al., 31 Mar 2026) in Section 6: Limitations and Future Work