Combating Adversarial Attacks with Multi-Agent Debate

Published 11 Jan 2024 in cs.CL and cs.AI | (2401.05998v1)

Abstract: While state-of-the-art LLMs have achieved impressive results, they remain susceptible to inference-time adversarial attacks, such as adversarial prompts generated by red teams arXiv:2209.07858. One approach proposed to improve the general quality of LLM generations is multi-agent debate, where LLMs self-evaluate through discussion and feedback arXiv:2305.14325. We implement multi-agent debate between current state-of-the-art LLMs and evaluate models' susceptibility to red team attacks in both single- and multi-agent settings. We find that multi-agent debate can reduce model toxicity when jailbroken or less capable models are forced to debate with non-jailbroken or more capable models. We also find marginal improvements through the general usage of multi-agent interactions. We further perform adversarial prompt content classification via embedding clustering, and analyze the susceptibility of different models to different types of attack topics.

Abstract PDF HTML Upgrade to Chat

References (22)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces multi-agent debate as a novel self-correction method that mitigates adversarial prompts.
The methodology leverages iterative model discussions where agents with varying safety measures critique and refine outputs.
Experiments show that debate-enabled models significantly reduce toxicity compared to traditional single-agent refinement techniques.

Introduction to Multi-Agent Debate

LLMs are increasingly sophisticated, yet they remain vulnerable to certain adversarial attacks, which can induce them to produce harmful content. A significant stride in strengthening these models involves using multi-agent debate: a process where models critique and revise their outputs through internal discussions. This approach stems from techniques like chain-of-thought reasoning and self-refinement, which have historically enhanced model performance and accuracy on various tasks.

Evaluating the Process

In their comprehensive study, researchers implemented multi-agent debate across models from the GPT and LLAMA families, challenging them with adversarial prompts tailored to provoke harmful responses. Models engaged in several rounds of discussion, considering the critiques of their peers to self-correct their outputs. The study primarily found that when models are arranged to debate, especially with counterparts possessing different safety precautions, they tended to yield less toxic responses, even improving upon models that had already been fine-tuned with advanced methods such as reinforcement learning from human feedback.

Debate Dynamics and Safety

Critical to the effectiveness of multi-agent debate is the interaction between 'agents', or instances of the AI with assigned roles in the discussion. When an agent programmed to generate safe content interacts with one primed for harmful outputs, the debate tends to steer the latter towards less toxic responses. Nevertheless, the reverse scenario can also occur: safe agents might be influenced by harmful ones, highlighting the dynamic and sensitive nature of the debate. Through extensive experiments, researchers demonstrated that while multi-agent discussions generally lead to better outcomes than single-agent self-refinement processes, outside influences on the debate still introduced variability in the results, reflecting the nuanced interplay of intentions among the participating models.

Looking Forward

The study's conclusions advocate for further exploration into richer, more complex debate frameworks and their potential as an additional layer of defense against adversarial prompting. There remains room to grow, noting the limitations in resources for deploying extensive multi-agent debates in real-time applications and the variances in AI linguistic capabilities. Future work in the field is encouraged to extend these concepts, potentially leveraging cross-provider model debates, enhancing interaction protocols, and fine-tuning for specific intentions to bolster the robustness of LLMs against adversarial threats.