Benchmarking Deception Probes via Black-to-White Performance Boosts

Published 16 Jul 2025 in cs.AI and cs.LG | (2507.12691v1)

Abstract: AI assistants will occasionally respond deceptively to user queries. Recently, linear classifiers (called "deception probes") have been trained to distinguish the internal activations of a LLM during deceptive versus honest responses. However, it's unclear how effective these probes are at detecting deception in practice, nor whether such probes are resistant to simple counter strategies from a deceptive assistant who wishes to evade detection. In this paper, we compare white-box monitoring (where the monitor has access to token-level probe activations) to black-box monitoring (without such access). We benchmark deception probes by the extent to which the white box monitor outperforms the black-box monitor, i.e. the black-to-white performance boost. We find weak but encouraging black-to-white performance boosts from existing deception probes.