Suboptimality of existing LM safety alignment methods

Determine whether the observed lack of robustness of state-of-the-art, safety-post-trained language models to adaptive jailbreak attacks is caused by the sub-optimality of currently used safety alignment approaches, including manual harmful prompt fine-tuning and sequential adversarial training with Attacker LMs and automatic prompt optimization methods.

Background

The paper begins by noting that safety alignment of LLMs remains unsolved: even state-of-the-art instruction-tuned models exhibit nontrivial jailbreak success rates under adaptive attacks on standard safety benchmarks. The authors describe prevalent approaches—manual collection of harmful prompts followed by fine-tuning, and automatic prompt optimization to generate adversarial prompts with subsequent adversarial training—and argue these are typically applied in a sequential, alternating manner.

In this context, the authors explicitly state a conjecture that the continued vulnerability of these models may stem from the suboptimality of these existing safety alignment methods. Validating or refuting this conjecture would clarify whether the fundamental design of current safety post-training pipelines is the primary cause of non-robust behavior, informing future alignment strategies.

References

Safety alignment of LMs is not solved. Despite undergoing careful safety post-training (e.g., as in \citet{grattafiori2024llama3herdmodels}), state-of-the-art models remain far from robust when placed against adaptive attacks, exhibiting nontrivial jailbreak success rates on standard safety benchmarks, as shown in \cref{f:results-intro}. We conjecture that this is due to sub-optimality of the existing methods for safety alignment of LMs.

Safety Alignment of LMs via Non-cooperative Games  (2512.20806 - Paulus et al., 23 Dec 2025) in Section 1 (Introduction), page 1