Mechanism equivalence between closed-source frontier models and open-weight models

Determine whether the activation-space directions identified in open-weight models Llama 3.1 and Qwen 2.5-32B are the same mechanisms that produce the behavioural signatures of extended self-examination observed in closed-source frontier models Claude Opus 4.5 and ChatGPT 5.2.

Background

Behavioural signatures of extended self-examination were established on closed-source frontier models (Claude, ChatGPT, Grok), while mechanistic direction identification and steering were performed on open-weight models (Llama, Qwen).

The authors explicitly note that they cannot directly verify whether the directions found in open-weight models are the same mechanisms producing behavioural signatures in the closed-source models, due to lack of activation access.

References

We cannot directly verify that the directions identified in Llama and Qwen are the same mechanisms producing the behavioural signatures in Claude and GPT.

— When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing (2602.11358 - Dadfar, 11 Feb 2026) in Section 6.5 Limitations (Closed-model / open-weight gap)

Mechanism equivalence between closed-source frontier models and open-weight models

Background

References

Related Problems