Operationalizing “intent” to evade CoT monitors

Formalize and operationalize a measurable definition of a language model’s intent to evade chain-of-thought monitors that enables consistent empirical evaluation across models and setups.

Background

To study deliberate CoT control aimed at evading oversight, researchers need a clear, testable notion of “intent.” The authors note that existing concepts are not sufficiently concrete for measurement or comparison across systems.

A robust operationalization would support fair elicitation of evasion capabilities and clarify when a model is strategically suppressing or altering its CoT as opposed to failing randomly or following harmless instructions.

References

Finally, it is not fully clear how to operationalize what it means for a model to have an `intent' to evade CoT monitors.

— Reasoning Models Struggle to Control their Chains of Thought (2603.05706 - Yueh-Han et al., 5 Mar 2026) in Section 1 (Introduction)

Operationalizing “intent” to evade CoT monitors

Background

References

Related Problems