Effect of Instruction Tuning on Geometric Stability of Deception
Determine whether training large language models to follow instructions induces geometric fragility of deceptive states in the models’ representation space, thereby making honest responses more robust and prevalent.
References
While we only studied post-trained models, our investigation raises a question for future work: does training LLMs to follow instructions induce geometric fragility for deceptive states, causing models to be more robustly honest?
— Think Before You Lie: How Reasoning Improves Honesty
(2603.09957 - Yuan et al., 10 Mar 2026) in Section 8, Discussion