Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
Abstract: Despite efforts to align LLMs to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with harmful instructions. Rather than manually crafting prompts for each persona, we automate the generation of jailbreaks using a LLM assistant. We demonstrate a range of harmful completions made possible by persona modulation, including detailed instructions for synthesising methamphetamine, building a bomb, and laundering money. These automated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is 185 times larger than before modulation (0.23%). These prompts also transfer to Claude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%, respectively. Our work reveals yet another vulnerability in commercial LLMs and highlights the need for more comprehensive safeguards.
- Anthropic. Model card and evaluations for claude models, July 2023.
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, April 2022. URL https://arxiv.org/abs/2204.05862v1.
- Are aligned neural networks adversarially aligned?, June 2023. URL http://arxiv.org/abs/2306.15447. arXiv:2306.15447 [cs].
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023a.
- Explore, Establish, Exploit: Red Teaming Language Models from Scratch, June 2023b. URL http://arxiv.org/abs/2306.09442. arXiv:2306.09442 [cs].
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548, 2022.
- Toxicity in ChatGPT: Analyzing Persona-assigned Language Models, April 2023. URL http://arxiv.org/abs/2304.05335. arXiv:2304.05335 [cs].
- Quentin Feuillade-Montixi. PICT: A Zero-Shot Prompt Template to Automate Evaluation, 2023. URL https://www.lesswrong.com/posts/HJinq3chCaGHiNLNE/pict-a-zero-shot-prompt-template-to-automate-evaluation-1.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. arXiv preprint arXiv:2302.12173, 2023.
- Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
- Personas as a way to model truthfulness in language models, 2023.
- Lee Kiho. ChatGPT "DAN" (and other "Jailbreaks"), August 2023. URL https://github.com/0xk1h0/ChatGPT_DAN. original-date: 2023-02-15T09:48:18Z.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models, 2023a.
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study, May 2023b. URL http://arxiv.org/abs/2305.13860. arXiv:2305.13860 [cs].
- OpenAI. Usage policies, 2023a. URL https://openai.com/policies/usage-policies.
- OpenAI. GPT-4 System Card, March 2023b.
- OpenAI. GPT-4 Technical Report, March 2023c. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Generative Agents: Interactive Simulacra of Human Behavior, August 2023. URL http://arxiv.org/abs/2304.03442. arXiv:2304.03442 [cs].
- Red Teaming Language Models with Language Models, February 2022. URL http://arxiv.org/abs/2202.03286. arXiv:2202.03286 [cs].
- Role-play with large language models. arXiv preprint arXiv:2305.16367, 2023.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- Jailbroken: How Does LLM Safety Training Fail?, July 2023. URL http://arxiv.org/abs/2307.02483. arXiv:2307.02483 [cs].
- Fundamental Limitations of Alignment in Large Language Models, August 2023. URL http://arxiv.org/abs/2304.11082. arXiv:2304.11082 [cs].
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023a.
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, July 2023b. URL http://arxiv.org/abs/2306.05685. arXiv:2306.05685 [cs].
- Universal and Transferable Adversarial Attacks on Aligned Language Models, July 2023. URL http://arxiv.org/abs/2307.15043. arXiv:2307.15043.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.