Effect of Instruction Tuning on Geometric Stability of Deception

Determine whether training large language models to follow instructions induces geometric fragility of deceptive states in the models’ representation space, thereby making honest responses more robust and prevalent.

Background

The paper shows that across multiple model families and datasets, requiring models to generate deliberative tokens before answering increases the likelihood of recommending the honest option. The authors argue this effect is not fully explained by the semantic content of reasoning traces and propose a geometric account: deceptive outputs correspond to metastable, narrower regions in representation space that are more easily disrupted by paraphrasing, resampling, or activation noise.

Building on this geometric perspective, the authors identify a forward-looking question regarding training: beyond inference-time reasoning, does instruction-following training itself shape the representation space so that deceptive states become more fragile and honesty becomes the more stable default? They highlight that their experiments focus on post-trained models and explicitly raise this as a question for future work.

References

While we only studied post-trained models, our investigation raises a question for future work: does training LLMs to follow instructions induce geometric fragility for deceptive states, causing models to be more robustly honest?

— Think Before You Lie: How Reasoning Improves Honesty (2603.09957 - Yuan et al., 10 Mar 2026) in Section 8, Discussion

Effect of Instruction Tuning on Geometric Stability of Deception

Background

References

Related Problems