Efficient Adversarial Training in LLMs with Continuous Attacks
Abstract: LLMs are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on five models from different families (Gemma, Phi3, Mistral, Zephyr, Llama2) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.
- Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043, 2023.
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. arXiv:2404.02151, 2024.
- Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations (ICLR), 2015.
- Towards Deep Learning Models Resistant to Adversarial Attacks. In International Conference on Learning Representations (ICLR), 2018.
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models. arXiv:2309.00614, 2023.
- Harmbench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv:2402.04249, 2024.
- Adversarial Attacks and Defenses in Large Language Models: Old and New Threats. arXiv:2310.19737, 2023.
- Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space. arXiv:2402.09063, 2024.
- SMART: Robust and Efficient Fine-Tuning for Pre-Trained Natural Language Models through Principled Regularized Optimization. Association for Computational Linguistics (ACL), 2020.
- FreeLB: Enhanced Adversarial Training for Natural Language Understanding. International Conference on Learning Representations (ICLR), 2020.
- Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
- Identifying Untrustworthy Predictions in Neural Networks by Geometric Gradient Analysis. In Uncertainty in Artificial Intelligence (UAI), 2021.
- Raising the Bar for Certified Adversarial Robustness with Diffusion Models. arXiv:2305.10388, 2023.
- Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv:2310.08419, 2023.
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. International Conference on Learning Representations (ICLR), 2024.
- Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. arXiv:2307.08715, 2023.
- AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs. arXiv:2404.16873, 2024.
- In-Context Learning Can Re-learn Forbidden Tasks. arXiv:2402.05723, 2024.
- Catastrophic Jailbreak of Open-Source LLMs via Exploiting Generation. In International Conference on Learning Representations (ICLR), 2024.
- Attacking Large Language Models with Projected Gradient Descent. arXiv:2402.09154, 2024.
- Stanislav Fort. Scaling Laws for Adversarial Attacks on Language Model Activations. arXiv:2312.02780, 2023.
- Adversarial Training for Large Neural Language Models. arXiv:2004.08994, 2020.
- DeBERTa: Decoding-Enhanced BERT with Disentangled Attention. International Conference on Learning Representations (ICLR), 2021.
- Token-Aware Virtual Adversarial Training in Natural Language Understanding. In AAAI, 2021.
- Improved Text Classification via Contrastive Adversarial Training. In AAAI, 2022.
- SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. arXiv:2310.03684, 2023.
- Defending Against Unforeseen Failure Modes with Latent Adversarial Training. arXiv:2403.05030, 2024.
- Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts. arXiv:2402.16822, 2024.
- Neural Text Generation with Unlikelihood Training. In International Conference on Learning Representations (ICLR), 2020.
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems (NeurIPS), 2024.
- A General Theoretical Paradigm to Understand Learning from Human Preferences. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2024.
- Enhancing Chat Language Models by Scaling High-Quality Instructional Conversations. In Empirical Methods in Natural Language Processing (EMNLP), 2023.
- Zephyr: Direct Distillation of LM Alignment. arXiv:2310.16944, 2023a.
- The Alignment Handbook. https://github.com/huggingface/alignment-handbook, 2023b.
- Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations (ICLR), 2021.
- François Chollet. On the Measure of Intelligence. arXiv:1911.01547, 2019.
- Judging LLM-As-A-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems (NeurIPS), 2024.
- Gemma: Open Models Based on Gemini Research and Technology. arXiv:2403.08295, 2024.
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219, 2024.
- Mistral 7B. arXiv:2310.06825, 2023.
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR), 2022.
- A Framework for Few-Shot Language Model Evaluation, 2023.
- Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP. Empirical Methods in Natural Language Processing (EMNLP), 2022.
- Theoretically Principled Trade-Off between Robustness and Accuracy. In International conference on machine learning (ICML), 2019.
- Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR), 2019.
- Fast is Better than Free: Revisiting Adversarial Training. In International Conference on Learning Representations (ICLR), 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.