Causal verification of the introspection mechanism in Qwen 2.5-32B

Establish whether the mechanism underlying vocabulary–activation correspondence in Qwen 2.5-32B is the same as the introspection direction identified in Llama 3.1 by extracting a Qwen-specific introspection direction at Layer 8, performing causal steering, and characterising dose–response effects.

Background

The paper demonstrates introspection-specific vocabulary–activation correspondences in Qwen 2.5-32B that vanish in descriptive controls, mirroring Llama’s findings but with different vocabulary–metric pairings. However, Qwen results are correlational only: no introspection direction was extracted and no steering was performed.

The authors explicitly state that without causal intervention they cannot confirm whether Qwen’s correspondence arises from the same mechanism as Llama’s introspection direction, highlighting a causal verification gap.

References

Our Qwen experiments are observational: we establish correspondence but do not extract a Qwen-specific introspection direction or test causal steering. The three Qwen correspondences survive all statistical controls and the descriptive control, but without causal intervention, we cannot confirm that the mechanism identified in Llama is the same one operating in Qwen.

— When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing (2602.11358 - Dadfar, 11 Feb 2026) in Section 5.6 Cross-Architecture Replication (Causal gap)

Causal verification of the introspection mechanism in Qwen 2.5-32B

Background

References

Related Problems