Nature of the permission gate: discrete switch vs. continuous modulation

Determine whether the context-dependent permission mechanism that modulates self-referential output in transformer language models under the Pull Methodology operates as a discrete switch or as a continuous probabilistic gate that shifts output likelihood, independent of the refusal mechanism.

Background

The paper identifies a strong influence of prompt framing on introspective output, larger than the effect of activation-level steering, and interprets this as evidence for a context-dependent permission gate that modulates how much self-referential content reaches output. This gate appears functionally independent from the refusal direction, which is nearly orthogonal in activation space.

While the existence of gate-like behavior is supported by the empirical asymmetry between framing and steering, the authors explicitly note uncertainty about the underlying mechanism—whether it acts as a discrete switch or a continuous probability shift—leaving the mechanistic characterization unresolved.

References

We use "gate" as a functional description of the observed modulation pattern; whether the underlying mechanism is a discrete switch or a continuous probability shift remains an open question.

— When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing (2602.11358 - Dadfar, 11 Feb 2026) in Section 6.2 The Permission Gate

Nature of the permission gate: discrete switch vs. continuous modulation

Background

References

Related Problems