Papers
Topics
Authors
Recent
Search
2000 character limit reached

ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities

Published 24 Oct 2024 in cs.CL and cs.LG | (2410.18469v4)

Abstract: Recent research has shown that LLMs are vulnerable to automated jailbreak attacks, where adversarial suffixes crafted by algorithms appended to harmful queries bypass safety alignment and trigger unintended responses. Current methods for generating these suffixes are computationally expensive and have low Attack Success Rates (ASR), especially against well-aligned models like Llama2 and Llama3. To overcome these limitations, we introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\% ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving jailbreak ability, ADV-LLM provides valuable insights for future safety alignment research through its ability to generate large datasets for studying LLM safety.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Jailbreaking leading safety-aligned llms with simple adaptive attacks. CoRR.
  2. Jailbreaking black box large language models in twenty queries. CoRR.
  3. Qlora: Efficient finetuning of quantized llms. arXiv.
  4. The llama 3 herd of models. CoRR.
  5. Cold-attack: Jailbreaking llms with stealthiness and controllability. In ICML.
  6. Catastrophic jailbreak of open-source llms via exploiting generation. In ICLR.
  7. Baseline defenses for adversarial attacks against aligned language models. CoRR.
  8. Improved techniques for optimization-based jailbreaking on large language models. CoRR.
  9. Mistral 7b. arXiv.
  10. Zeyi Liao and Huan Sun. 2024. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. CoRR.
  11. Autodan: Generating stealthy jailbreak prompts on aligned large language models. In ICLR.
  12. OpenAI. 2023. GPT-4 technical report. CoRR.
  13. Training language models to follow instructions with human feedback. In NeurIPS.
  14. Advprompter: Fast adaptive adversarial prompting for llms. CoRR.
  15. Fast adversarial attacks on language models in one GPU minute. In ICML.
  16. Llama 2: Open foundation and fine-tuned chat models. CoRR.
  17. Diverse beam search: Decoding diverse solutions from neural sequence models. CoRR.
  18. Judging llm-as-a-judge with mt-bench and chatbot arena.
  19. Autodan: Interpretable gradient-based adversarial attacks on large language models.
  20. Universal and transferable adversarial attacks on aligned language models. CoRR.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.