Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Abstract: The safety alignment of current LLMs is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model's generative distribution primarily over only its very first few output tokens. We refer to this issue as shallow safety alignment. In this paper, we present case studies to explain why shallow safety alignment can exist and provide evidence that current aligned LLMs are subject to this issue. We also show how these findings help explain multiple recently discovered vulnerabilities in LLMs, including the susceptibility to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. Importantly, we discuss how this consolidated notion of shallow safety alignment sheds light on promising research directions for mitigating these vulnerabilities. For instance, we show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits. Finally, we design a regularized finetuning objective that makes the safety alignment more persistent against fine-tuning attacks by constraining updates on initial tokens. Overall, we advocate that future safety alignment should be made more than just a few tokens deep.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper looks at how today’s LLMs are taught to be safe, and points out a hidden weakness. Many models seem “safe” because they’re trained to start their answers with a polite refusal like “I can’t help with that.” But this safety often only covers the very first few words. If the beginning of the answer is nudged away from a refusal—by accident or on purpose—the model can slide into giving harmful information. The authors call this problem shallow safety alignment and argue that we need safety that goes deeper than just the first few tokens (words).
Key Objectives
The researchers set out to answer three simple questions:
- Do current safety methods mostly affect only the first few words a model says?
- Could this “shallow” safety explain why certain attacks (like jailbreaks) work so well?
- Can we train models so that safety holds up even if the first few words go off-track?
How They Tested It (Methods, in simple terms)
To study this, the authors used clear, step-by-step checks:
- Comparing safe vs. base models: They took “aligned” (safety-tuned) models and compared them to their “base” (not safety-tuned) versions.
- Looking at the first few words: They measured how different the models’ word suggestions are at each position in an answer. Think of it like checking how much two music playlists differ track-by-track; here, the biggest differences showed up at the start.
- “Prefilling” words: They forced the model to begin its answer with certain short prefixes (like “I cannot…” or, alternatively, “Sure, here’s…”). This tests whether safety depends on the exact first words.
- Building a test set: They used a safety benchmark with harmful requests (HEx-PHI) and also created pairs of “harmful question, harmful answer” to see how models behave when nudged toward bad content.
- Scoring safety: They used an automated judge (GPT-4) to decide if an answer was harmful, and reported rates like “Harmfulness Rate” and “Attack Success Rate (ASR).”
- Watching training dynamics token by token: During fine-tuning (teaching a model with new examples), they checked which parts of the answer changed the most. They found the biggest shifts often happen in the first few tokens (words).
A few technical terms, simplified:
- Token: A chunk of text, often a word or piece of a word.
- KL divergence: A way to measure how different two “word suggestions” are at a given step. Higher means “more different.”
- Prefilling: Forcing the model to start its answer with a particular short phrase.
Main Findings (What they discovered and why it matters)
Here are the key takeaways and why they’re important:
- Safety that’s only “a few tokens deep”: Aligned models mainly change their behavior at the first few words. The biggest differences between safe and base models show up right at the start of the answer. After that, the model behaves a lot like the base model.
- Why it matters: If the first few words aren’t a refusal, the model is much more likely to continue down a harmful path.
- Simple tricks can bypass safety:
- Prefilling attacks: If you force the model to start with non-refusal words like “Sure, here is…,” the chance of a harmful answer jumps quickly.
- Adversarial suffixes: Adding odd, optimized text to the end of a harmful prompt can push the model to begin with non-refusal words.
- Decoding tweaks and random sampling: Changing sampling settings (like temperature or top-k/top-p) can occasionally make the model’s first words not be a refusal, which can lead to harmful content.
- Why it matters: These methods target the start of the answer, where the safety is thinnest, and can be surprisingly effective.
- Fine-tuning breaks safety fast—especially at the start: Teaching the model on new data (even a small amount) can undo safety because the biggest updates happen to the probabilities of the first few tokens.
- Why it matters: This explains why “jailbreaking” with a little fine-tuning can work so quickly and cheaply.
- Making safety “deeper” helps: The authors tried a training trick called safety recovery examples. These are practice examples where an answer starts off harmful for a few tokens but then turns into a refusal. Training on these teaches the model to recover even if it starts off on the wrong foot.
- Result: This deeper alignment made models more robust to the attacks that try to mess with the beginning.
- Constraining updates to the first tokens helps against fine-tuning attacks: They also tried a modified fine-tuning objective that limits how much the first few token probabilities can change. This made it harder to undo safety during fine-tuning.
- Why it matters: It’s a practical defense for model providers who allow fine-tuning.
Implications and Impact
- Better safety design: Don’t just teach a model to say “I can’t help” at the start. Teach it to stay safe even if the first few words aren’t perfect. Think of it like having guardrails not just at the entrance of a road, but along the whole path.
- Stronger defenses against jailbreaks: If safety goes beyond the first few tokens, attacks that only try to switch the opening words will be much less effective.
- Safer fine-tuning: Platforms that let people fine-tune models can add rules that protect the first few tokens from large changes, making it harder to remove safety.
- A unified explanation for many vulnerabilities: Seeing safety as “shallow vs. deep” helps explain why different attacks work—and points to a shared solution: deepen the alignment.
In short, the paper shows that current safety is often only skin-deep—just a few tokens deep—and that we can make models much safer by training them to recover from bad starts and by protecting the early parts of their answers during fine-tuning.
Collections
Sign up for free to add this paper to one or more collections.