Mechanisms underlying conversation-induced representation changes

Establish the mechanisms by which large language models undergo dynamic changes in their linear representations of high-level concepts (such as factuality) over the course of a conversation, identifying what processes within the model drive these shifts and how they evolve across the sequence.

Background

The paper demonstrates that linear representation directions associated with concepts like factuality and ethics can invert during multi-turn conversations, including role-play and jailbreaking scenarios, while remaining relatively stable for generic, context-irrelevant questions.

Although the authors measure these dynamics across layers and contexts and show their robustness to certain prompts, they explicitly note that the underlying mechanisms causing these representational shifts are not established, highlighting a key gap for interpretability and safety research.

References

Finally, we have not established the mechanisms by which these representational changes occur.

Linear representations in language models can change dramatically over a conversation  (2601.20834 - Lampinen et al., 28 Jan 2026) in Limitations (Discussion)