Robustness of reported trends across alternative probe architectures

Establish whether the generalisation trends observed for linear and attention probes under response‑strategy and domain shifts also hold for other probe architectures, including deeper non‑linear probes and alternative monitoring methods.

Background

The study primarily analyses linear and attention probes and reports that domain shifts degrade performance more than response-strategy shifts across multiple behaviours and models. However, many probe architectures exist beyond these two, including deeper non-linear "deep" probes and other monitoring approaches.

Determining whether the reported generalisation patterns extend to other probe types is important for designing reliable monitoring systems and for understanding whether certain architectures are inherently more robust to distribution shifts.

References

Moreover, we tested two kinds of probes, linear and attention, but it is unclear whether our results hold for other probe types. Future work should evaluate more architectures for generalisation failures and overfitting issues, especially `deep' probes that contain many layers and non-linearities \citep{anthropic_probes_cost}.

— That's not natural: The Impact of Off-Policy Training Data on Probe Performance (2511.17408 - Kirch et al., 21 Nov 2025) in Section: Limitations and Future Work

Robustness of reported trends across alternative probe architectures

Background

References

Related Problems