Predict performance thresholds that trigger monitorability drops

Ascertain the performance thresholds on the outcome reward R_out and the Chain-of-Thought reward R_cot at which monitorability degradation is observed in practice during reinforcement learning post-training.

Background

The framework predicts that monitorability issues appear when models achieve sufficiently high performance on both outcome and CoT rewards in in-conflict settings. However, the exact levels of performance at which degradation occurs are not characterized.

The authors explicitly state it is unclear whether monitorability drops will be observed in practice without knowing these thresholds, highlighting the need for methods to predict such failures of optimization.

References

It is difficult to know what level of performance is likely to cause issues in practice. Hence, whether a drop in monitorability will actually be observed is unclear.

— Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought? (2603.30036 - Kaufmann et al., 31 Mar 2026) in Section 6: Limitations and Future Work

Predict performance thresholds that trigger monitorability drops

Background

References

Related Problems