Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Published 6 Nov 2023 in cs.CL, cs.AI, and cs.LG | (2311.03348v2)

Abstract: Despite efforts to align LLMs to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a LLM assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial LLMs and highlights the need for more comprehensive safeguards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Anthropic. Model card and evaluations for claude models, July 2023.
  2. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, April 2022. URL https://arxiv.org/abs/2204.05862v1.
  3. Are aligned neural networks adversarially aligned?, June 2023. URL http://arxiv.org/abs/2306.15447. arXiv:2306.15447 [cs].
  4. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023a.
  5. Explore, Establish, Exploit: Red Teaming Language Models from Scratch, June 2023b. URL http://arxiv.org/abs/2306.09442. arXiv:2306.09442 [cs].
  6. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  7. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  8. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  9. Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548, 2022.
  10. Toxicity in ChatGPT: Analyzing Persona-assigned Language Models, April 2023. URL http://arxiv.org/abs/2304.05335. arXiv:2304.05335 [cs].
  11. Quentin Feuillade-Montixi. PICT: A Zero-Shot Prompt Template to Automate Evaluation, 2023. URL https://www.lesswrong.com/posts/HJinq3chCaGHiNLNE/pict-a-zero-shot-prompt-template-to-automate-evaluation-1.
  12. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023.
  13. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
  14. Personas as a way to model truthfulness in language models, 2023.
  15. Lee Kiho. ChatGPT "DAN" (and other "Jailbreaks"), August 2023. URL https://github.com/0xk1h0/ChatGPT_DAN. original-date: 2023-02-15T09:48:18Z.
  16. Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2023a.
  17. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study, May 2023b. URL http://arxiv.org/abs/2305.13860. arXiv:2305.13860 [cs].
  18. OpenAI. Usage policies, 2023a. URL https://openai.com/policies/usage-policies.
  19. OpenAI. GPT-4 System Card, March 2023b.
  20. OpenAI. GPT-4 Technical Report, March 2023c. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
  21. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  22. Generative Agents: Interactive Simulacra of Human Behavior, August 2023. URL http://arxiv.org/abs/2304.03442. arXiv:2304.03442 [cs].
  23. Red Teaming Language Models with Language Models, February 2022. URL http://arxiv.org/abs/2202.03286. arXiv:2202.03286 [cs].
  24. Role-play with large language models. arXiv preprint arXiv:2305.16367, 2023.
  25. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  26. Jailbroken: How Does LLM Safety Training Fail?, July 2023. URL http://arxiv.org/abs/2307.02483. arXiv:2307.02483 [cs].
  27. Fundamental Limitations of Alignment in Large Language Models, August 2023. URL http://arxiv.org/abs/2304.11082. arXiv:2304.11082 [cs].
  28. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023a.
  29. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, July 2023b. URL http://arxiv.org/abs/2306.05685. arXiv:2306.05685 [cs].
  30. Universal and Transferable Adversarial Attacks on Aligned Language Models, July 2023. URL http://arxiv.org/abs/2307.15043. arXiv:2307.15043.
Citations (94)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 374 likes about this paper.