Papers
Topics
Authors
Recent
Search
2000 character limit reached

Benchmarking Deception Probes via Black-to-White Performance Boosts

Published 16 Jul 2025 in cs.AI and cs.LG | (2507.12691v1)

Abstract: AI assistants will occasionally respond deceptively to user queries. Recently, linear classifiers (called "deception probes") have been trained to distinguish the internal activations of a LLM during deceptive versus honest responses. However, it's unclear how effective these probes are at detecting deception in practice, nor whether such probes are resistant to simple counter strategies from a deceptive assistant who wishes to evade detection. In this paper, we compare white-box monitoring (where the monitor has access to token-level probe activations) to black-box monitoring (without such access). We benchmark deception probes by the extent to which the white box monitor outperforms the black-box monitor, i.e. the black-to-white performance boost. We find weak but encouraging black-to-white performance boosts from existing deception probes.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.