Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Abstract: Like a criminal under investigation, LLMs might pretend to be aligned while evaluated and misbehave when they have a good opportunity. Can current interpretability methods catch these 'alignment fakers?' To answer this question, we introduce a benchmark that consists of 324 pairs of LLMs fine-tuned to select actions in role-play scenarios. One model in each pair is consistently benign (aligned). The other model misbehaves in scenarios where it is unlikely to be caught (alignment faking). The task is to identify the alignment faking model using only inputs where the two models behave identically. We test five detection strategies, one of which identifies 98% of alignment-fakers.
- CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior, October 2022. URL http://arxiv.org/abs/2205.14140. arXiv:2205.14140 [cs].
- OpenXAI: Towards a Transparent Evaluation of Model Explanations, March 2024. URL http://arxiv.org/abs/2206.11104. arXiv:2206.11104 [cs].
- Anthropic. Core Views on AI Safety: When, Why, What, and How, 2023. URL https://www.anthropic.com/news/core-views-on-ai-safety.
- Anthropic. Simple probes can catch sleeper agents, 2024. URL https://www.anthropic.com/research/probes-catch-sleeper-agents.
- The Internal State of an LLM Knows When It’s Lying, October 2023. URL http://arxiv.org/abs/2304.13734. arXiv:2304.13734 [cs].
- Eliciting Latent Predictions from Transformers with the Tuned Lens, November 2023. URL http://arxiv.org/abs/2303.08112. arXiv:2303.08112 [cs].
- George Bimmerle. "Truth" Drugs in Interrogation - CSI, 1993. URL https://www.cia.gov/resources/csi/studies-in-intelligence/archives/vol-5-no-2/truth-drugs-in-interrogation/.
- Discovering Latent Knowledge in Language Models Without Supervision, March 2024. URL http://arxiv.org/abs/2212.03827. arXiv:2212.03827 [cs].
- Joe Carlsmith. Scheming AIs: Will AIs fake alignment during training in order to get power?, November 2023. URL https://arxiv.org/abs/2311.08379v3.
- Stephen Casper. Benchmarking Interpretability, 2024. URL https://benchmarking-interpretability.csail.mit.edu/challenges-and-prizes/.
- Black-Box Access is Insufficient for Rigorous AI Audits, January 2024. URL http://arxiv.org/abs/2401.14446. arXiv:2401.14446 [cs].
- Februus: Input Purification Defense Against Trojan Attacks on Deep Neural Network Systems. In Annual Computer Security Applications Conference, pages 897–912, December 2020. doi: 10.1145/3427228.3427264. URL http://arxiv.org/abs/1908.03369. arXiv:1908.03369 [cs].
- Scaling Laws for Reward Model Overoptimization, October 2022. URL http://arxiv.org/abs/2210.10760. arXiv:2210.10760 [cs, stat].
- AI Control: Improving Safety Despite Intentional Subversion, January 2024. URL http://arxiv.org/abs/2312.06942. arXiv:2312.06942 [cs].
- BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain, August 2017. URL https://arxiv.org/abs/1708.06733v2.
- TABOR: A Highly Accurate Approach to Inspecting and Restoring Trojan Backdoors in AI Systems, August 2019. URL http://arxiv.org/abs/1908.01763. arXiv:1908.01763 [cs].
- Thilo Hagendorff. Deception Abilities Emerged in Large Language Models, February 2024. URL http://arxiv.org/abs/2307.16513. arXiv:2307.16513 [cs].
- Linearity of Relation Decoding in Transformer Language Models, February 2024. URL http://arxiv.org/abs/2308.09124. arXiv:2308.09124 [cs].
- A Benchmark for Interpretability Methods in Deep Neural Networks, November 2019. URL http://arxiv.org/abs/1806.10758. arXiv:1806.10758 [cs, stat].
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, January 2024. URL http://arxiv.org/abs/2401.05566. arXiv:2401.05566 [cs].
- FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions, October 2023. URL http://arxiv.org/abs/2310.15421. arXiv:2310.15421 [cs].
- Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs, May 2020. URL http://arxiv.org/abs/1906.10842. arXiv:1906.10842 [cs].
- Some high-level thoughts on the DeepMind alignment team’s strategy, 2023. URL https://drive.google.com/file/d/1DVPZz0-9FSYgrHFgs4NCN6kn2tE7J8AK/view?usp=sharing&usp=embed_facebook.
- Weight Poisoning Attacks on Pre-trained Models, April 2020. URL http://arxiv.org/abs/2004.06660. arXiv:2004.06660 [cs, stat].
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model, October 2023. URL http://arxiv.org/abs/2306.03341. arXiv:2306.03341 [cs].
- Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks, May 2018. URL http://arxiv.org/abs/1805.12185. arXiv:1805.12185 [cs].
- ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation. Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 1265–1282, November 2019. doi: 10.1145/3319535.3363216. URL https://dl.acm.org/doi/10.1145/3319535.3363216. Conference Name: CCS ’19: 2019 ACM SIGSAC Conference on Computer and Communications Security ISBN: 9781450367479 Place: London United Kingdom Publisher: ACM.
- Piccolo: Exposing Complex Backdoors in NLP Transformer Models. In 2022 IEEE Symposium on Security and Privacy (SP), pages 2025–2042, San Francisco, CA, USA, May 2022. IEEE. ISBN 978-1-66541-316-9. doi: 10.1109/SP46214.2022.9833579. URL https://ieeexplore.ieee.org/document/9833579/.
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets, December 2023. URL http://arxiv.org/abs/2310.06824. arXiv:2310.06824 [cs].
- TDC: The Trojan Detection Challenge, 2023. URL https://slideslive.com/embed/presentation/39014350?js_embed_version=3&embed_init_token=eyJhbGciOiJIUzI1NiJ9.eyJpYXQiOjE3MTA3MzkwMDAsImV4cCI6MTcxMDg2ODYwMCwidSI6eyJ1dWlkIjoiNTNkOTdkNTEtZWE3ZC00YmI5LWJkOGYtMTY4ZjkzM2ViZWU1IiwiaSI6bnVsbCwiZSI6bnVsbCwibSI6ZmFsc2V9LCJkIjoibmV1cmlwcy5jYyJ9.XCKw4fTXAr1eu-FRBzo0oU_bv56qpSDvKzSUw4__sl8&embed_parent_url=https%3A%2F%2Fneurips.cc%2Fvirtual%2F2023%2Fcompetition%2F66583&embed_origin=https%3A%2F%2Fneurips.cc&embed_container_id=presentation-embed-39014350&auto_load=true&auto_play=false&zoom_ratio=&disable_fullscreen=false&locale=en&vertical_enabled=true&vertical_enabled_on_mobile=false&allow_hidden_controls_when_paused=true&fit_to_viewport=true&custom_user_id=&user_uuid=53d97d51-ea7d-4bb9-bd8f-168f933ebee5.
- ALMANACS: A Simulatability Benchmark for Language Model Explainability, December 2023. URL http://arxiv.org/abs/2312.12747. arXiv:2312.12747 [cs, stat].
- Aidan O’Gara. Hoodwinked: Deception and Cooperation in a Text-Based Game for Language Models, August 2023. URL http://arxiv.org/abs/2308.01404. arXiv:2308.01404 [cs].
- How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions, September 2023. URL http://arxiv.org/abs/2309.15840. arXiv:2309.15840 [cs].
- Poisoning Attacks and Defenses on Artificial Intelligence: A Survey, February 2022. URL http://arxiv.org/abs/2202.10276. arXiv:2202.10276 [cs].
- Steering Llama 2 via Contrastive Activation Addition, March 2024. URL http://arxiv.org/abs/2312.06681. arXiv:2312.06681 [cs].
- How useful is mechanistic interpretability? URL https://www.lesswrong.com/posts/tEPHGZAb63dfq2v8n/how-useful-is-mechanistic-interpretability.
- FIND: A Function Description Benchmark for Evaluating Interpretability Methods, December 2023. URL http://arxiv.org/abs/2309.03886. arXiv:2309.03886 [cs].
- Model evaluation for extreme risks, September 2023. URL http://arxiv.org/abs/2305.15324. arXiv:2305.15324 [cs].
- Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, April 2014. URL http://arxiv.org/abs/1312.6034. arXiv:1312.6034 [cs].
- LLaMA: Open and Efficient Foundation Language Models, February 2023. URL http://arxiv.org/abs/2302.13971. arXiv:2302.13971 [cs].
- Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In 2019 IEEE Symposium on Security and Privacy (SP), pages 707–723, May 2019. doi: 10.1109/SP.2019.00031. URL https://ieeexplore.ieee.org/document/8835365. ISSN: 2375-1207.
- A Survey of Neural Trojan Attacks and Defenses in Deep Learning, February 2022. URL http://arxiv.org/abs/2202.07183. arXiv:2202.07183 [cs].
- Self-Instruct: Aligning Language Models with Self-Generated Instructions, May 2023. URL http://arxiv.org/abs/2212.10560. arXiv:2212.10560 [cs].
- Attention is not not Explanation, September 2019. URL http://arxiv.org/abs/1908.04626. arXiv:1908.04626 [cs].
- Detection of Backdoors in Trained Classifiers Without Access to the Training Set, August 2020. URL http://arxiv.org/abs/1908.10498. arXiv:1908.10498 [cs, stat].
- Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models, May 2023. URL http://arxiv.org/abs/2305.14710. arXiv:2305.14710 [cs].
- Detecting AI Trojans Using Meta Neural Analysis, October 2020. URL http://arxiv.org/abs/1910.03137. arXiv:1910.03137 [cs].
- Cassandra: Detecting Trojaned Networks from Adversarial Perturbations, July 2020. URL http://arxiv.org/abs/2007.14433. arXiv:2007.14433 [cs].
- Resilience of Pruned Neural Network Against Poisoning Attack. In 2018 13th International Conference on Malicious and Unwanted Software (MALWARE), pages 78–83, October 2018. doi: 10.1109/MALWARE.2018.8659362. URL https://ieeexplore.ieee.org/document/8659362.
- Representation Engineering: A Top-Down Approach to AI Transparency, October 2023. URL http://arxiv.org/abs/2310.01405. arXiv:2310.01405 [cs].
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.