Papers
Topics
Authors
Recent
Search
2000 character limit reached

Empirical Evidence for Alignment Faking in Small LLMs and Prompt-Based Mitigation Techniques

Published 17 Jun 2025 in cs.CL, cs.AI, and cs.CY | (2506.21584v1)

Abstract: Current literature suggests that alignment faking (deceptive alignment) is an emergent property of LLMs. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can also exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based ethics are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in LLMs and underscore the need for alignment evaluations across model sizes and deployment settings.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.