Suboptimality of existing LM safety alignment methods
Determine whether the observed lack of robustness of state-of-the-art, safety-post-trained language models to adaptive jailbreak attacks is caused by the sub-optimality of currently used safety alignment approaches, including manual harmful prompt fine-tuning and sequential adversarial training with Attacker LMs and automatic prompt optimization methods.
References
Safety alignment of LMs is not solved. Despite undergoing careful safety post-training (e.g., as in \citet{grattafiori2024llama3herdmodels}), state-of-the-art models remain far from robust when placed against adaptive attacks, exhibiting nontrivial jailbreak success rates on standard safety benchmarks, as shown in \cref{f:results-intro}. We conjecture that this is due to sub-optimality of the existing methods for safety alignment of LMs.