Robustness of reported trends across alternative probe architectures
Establish whether the generalisation trends observed for linear and attention probes under response‑strategy and domain shifts also hold for other probe architectures, including deeper non‑linear probes and alternative monitoring methods.
References
Moreover, we tested two kinds of probes, linear and attention, but it is unclear whether our results hold for other probe types. Future work should evaluate more architectures for generalisation failures and overfitting issues, especially `deep' probes that contain many layers and non-linearities \citep{anthropic_probes_cost}.
— That's not natural: The Impact of Off-Policy Training Data on Probe Performance
(2511.17408 - Kirch et al., 21 Nov 2025) in Section: Limitations and Future Work