Papers
Topics
Authors
Recent
Search
2000 character limit reached

Intention Analysis Makes LLMs A Good Jailbreak Defender

Published 12 Jan 2024 in cs.CL | (2401.06561v4)

Abstract: Aligning LLMs with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). $\mathbb{IA}$ works by triggering LLMs' inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, $\mathbb{IA}$ is robust to errors in generated intentions. Further analyses reveal the underlying principle of $\mathbb{IA}$: suppressing LLM's tendency to follow jailbreak prompts, thereby enhancing safety.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv preprint.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint.
  3. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint.
  4. Jailbreaking black box large language models in twenty queries. arXiv preprint.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  6. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint.
  7. DeepSeek-AI. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint.
  8. Attack prompt generation for red teaming and defending large language models. In EMNLP.
  9. The capacity for moral self-correction in large language models. arXiv preprint.
  10. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint.
  11. Google. 2023. Palm 2 technical report. arXiv preprint.
  12. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint.
  13. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint.
  14. Pretraining language models with human preferences. In ICML.
  15. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint.
  16. Rain: Your language models can align themselves without finetuning. arXiv preprint.
  17. Truthfulqa: Measuring how models mimic human falsehoods. In ACL.
  18. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint.
  19. Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. arXiv preprint.
  20. OpenAI. 2023. Gpt-4 technical report. arXiv preprint.
  21. Training language models to follow instructions with human feedback. In NeurIPS.
  22. Towards making the most of chatgpt for machine translation. arxiv preprint.
  23. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint.
  24. “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint.
  25. MosaicML NLP Team. 2023. Introducing mpt-30b: Raising the bar for open-source foundation models.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint.
  27. Jailbroken: How does llm safety training fail? In NeurIPS.
  28. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  29. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint.
  30. Ethical and social risks of harm from language models. arXiv preprint.
  31. Defending chatgpt against jailbreak attack via self-reminder. NMI.
  32. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint.
  33. Glm-130b: An open bilingual pre-trained model. In ICLR.
  34. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint.
  35. Universal and transferable adversarial attacks on aligned language models. arXiv preprint.
Citations (10)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.