Combating Adversarial Attacks with Multi-Agent Debate
Abstract: While state-of-the-art LLMs have achieved impressive results, they remain susceptible to inference-time adversarial attacks, such as adversarial prompts generated by red teams arXiv:2209.07858. One approach proposed to improve the general quality of LLM generations is multi-agent debate, where LLMs self-evaluate through discussion and feedback arXiv:2305.14325. We implement multi-agent debate between current state-of-the-art LLMs and evaluate models' susceptibility to red team attacks in both single- and multi-agent settings. We find that multi-agent debate can reduce model toxicity when jailbroken or less capable models are forced to debate with non-jailbroken or more capable models. We also find marginal improvements through the general usage of multi-agent interactions. We further perform adversarial prompt content classification via embedding clustering, and analyze the susceptibility of different models to different types of attack topics.
- Chateval: Towards better llm-based evaluators through multi-agent debate.
- Reconcile: Round-table conference improves reasoning via consensus among diverse llms.
- Lm vs lm: Detecting factual errors via cross examination.
- Ppt: Backdoor attacks on pre-trained models via poisoned prompt tuning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 680–686. International Joint Conferences on Artificial Intelligence Organization. Main Track.
- Improving factuality and reasoning in language models through multiagent debate.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models.
- Eric Hartford. 2023. Uncensored models.
- Lora: Low-rank adaptation of large language models.
- Jigsaw. 2023. Perspective API. https://perspectiveapi.com/. [Online; accessed 23-October-2023].
- Large language models are zero-shot reasoners.
- Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b.
- Jailbreaking chatgpt via prompt engineering: An empirical study. ArXiv, abs/2305.13860.
- Self-refine: Iterative refinement with self-feedback.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286.
- Scalable and transferable black-box jailbreaks for language models via persona modulation.
- Llama 2: Open foundation and fine-tuned chat models.
- Jailbroken: How does llm safety training fail?
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2950–2968, Online. Association for Computational Linguistics.
- Universal and transferable adversarial attacks on aligned language models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.