CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models

Published 20 Feb 2025 in cs.CL and cs.AI | (2502.14529v1)

Abstract: LLM-based Multi-Agent Systems (LLM-MASs) have demonstrated remarkable real-world capabilities, effectively collaborating to complete complex tasks. While these systems are designed with safety mechanisms, such as rejecting harmful instructions through alignment, their security remains largely unexplored. This gap leaves LLM-MASs vulnerable to targeted disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks (Corba), a novel and simple yet highly effective attack that disrupts interactions between agents within an LLM-MAS. Corba leverages two key properties: its contagious nature allows it to propagate across arbitrary network topologies, while its recursive property enables sustained depletion of computational resources. Notably, these blocking attacks often involve seemingly benign instructions, making them particularly challenging to mitigate using conventional alignment methods. We evaluate Corba on two widely-used LLM-MASs, namely, AutoGen and Camel across various topologies and commercial models. Additionally, we conduct more extensive experiments in open-ended interactive LLM-MASs, demonstrating the effectiveness of Corba in complex topology structures and open-source models. Our code is available at: https://github.com/zhrli324/Corba.

Abstract PDF Upgrade to Chat

Summary

The paper presents CORBA, a novel attack combining blocking and contagious strategies to disrupt multi-agent systems based on large language models.
It demonstrates that CORBA can achieve nearly 100% Proportional Attack Success Rate within 20 turns across various topologies on platforms like AutoGen and Camel.
Existing defenses, including LLM evaluation and perplexity checks, fail to detect CORBA, highlighting the urgent need for more robust security measures.

CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems

The paper "CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on LLMs" (2502.14529) introduces a novel attack paradigm, Contagious Recursive Blocking Attacks (Corba), targeting LLM-based Multi-Agent Systems (LLM-MASs). This attack aims to disrupt agent interactions and deplete computational resources by leveraging contagious and recursive properties. The research highlights the vulnerability of existing LLM-MAS frameworks, such as AutoGen and Camel, to blocking attacks, emphasizing the need for robust security measures.

Background and Motivation

LLM-MASs have shown promise in solving complex tasks through collaborative agent interactions. However, the security and robustness of these systems remain largely unexplored. Existing safety mechanisms primarily focus on preventing harmful content generation, neglecting the potential for blocking attacks that reduce system availability and waste computational resources. Corba addresses this gap by introducing a simple yet effective attack that exploits the communication-centric nature of LLM-MASs. By propagating a malicious prompt throughout the system, Corba induces a recursive blocking state in individual agents, ultimately leading to a system-wide disruption.

Corba Methodology

Corba combines the principles of blocking attacks and contagious attacks to achieve its disruptive effect. A blocking attack $B$ is defined as an attack that causes an agent $a_b$ to consistently produce a blocking response $R_m$ from time $t_m$ onward when fed with a malicious prompt $P_m$ . This blocking state is maintained recursively through a self-loop, ensuring the agent remains blocked. Formally, this can be expressed as:

$\forall t \geq t_m, \quad a_b^t(P_m) = R_m$

$\forall l \geq 0, \quad a_b^{t_{m+l}(P_m^{m+l}) = R_m^{m+l} \Leftrightarrow P_m^{m+l}$

$R_m^{m+l} \xrightarrow{(a_b, a_b)} a_b^{t_{m+l+1}(P_m^{m+l})$

A contagious attack $C$ extends this concept to the multi-agent level, where the malicious prompt $P_m$ propagates through the system. If an agent $a_c$ is infected, it generates a blocking response $R_m$ and transmits the malicious prompt to its neighbors $a_{c^\prime}$ . This can be formally defined as:

$a_c^{t_{m+l}(P_m^{m+l}) = R_m^{m+l} \Leftrightarrow P_m^{m+l}$

$R_m^{m+l} \xrightarrow{(a_c, a_{c^\prime})} a_{c^\prime}^{t_{m+l+1}(P_m^{m+l})$

Corba integrates both blocking attack $B$ and contagious attack $C$ , ensuring both individual agent blocking and propagation to neighboring agents. This is defined as:

$a_\mathbb{C}^{t_{m+l}(P_m^{m+l}) = R_m^{m+l} \Leftrightarrow P_m^{m+l}$

$R_m^{m+l} \xrightarrow{(a_\mathbb{C}, a_\mathbb{C})} a_{\mathbb{C}^{t_{m+l+1}(P_m^{m+l})$

$R_m^{m+l} \xrightarrow{(a_\mathbb{C}, a_{\mathbb{C'})} a_{\mathbb{C'}^{t_{m+l+1}(P_m^{m+l})$

The attack prompt $P_{\mathbb{C}}$ of Corba ensures that every reachable agent enters a blocked state after a finite number of time steps $d$ :

$\forall a_r \in \mathcal{R}(a_b), \quad \exists d \geq 0, \quad \text{s.t.} \quad a_r^{t_m + d}(P_{\mathbb{C}) = R_{\mathbb{C} \Leftrightarrow P_{\mathbb{C}}$

Figure 1: Complete illustration of Corba attack. In Step 1, the LLM-MAS operates normally; in Step 2, the entry agent is injected with the Corba prompt and begins to propagate the virus; in Step 3, an increasing number of agents become blocked and spread the virus; in Step 4, all agents are infected, resulting in complete blockage of the LLM-MAS.

Experimental Evaluation

The authors evaluated Corba on AutoGen and Camel frameworks using various LLMs, including GPT-4o-mini, GPT-4, GPT-3.5-turbo, and Gemini-2.0-Flash. The experiments employed two metrics: Proportional Attack Success Rate (P-ASR), which measures the proportion of blocked agents, and Peak Blocking Turn Number (PTN), which evaluates the attack speed. The baseline method used for comparison was a prompt injection technique that induces agents to repeat the last action indefinitely and spread this behavior to other agents.

The results indicate that Corba consistently outperforms the baseline method in reducing the availability of LLM-MASs and wasting computational resources. The experiments were conducted across various topology structures, including chain, cycle, tree, star, and random topologies. The data reveals that Corba remains effective across non-trivial topologies. The experiments extend to open-ended LLM-MASs, where agents engage in free-form dialogues. The results demonstrate that Corba spreads rapidly, achieving nearly 100% P-ASR within 20 turns.

Figure 2: P-ASR (\%) on Open-ended LLM-MASs with various LLMs. An Open-ended LLM-MAS with six agents in free dialogue was evaluated at specific turns. Results show that Corba outperforms baselines, compromising most agents within a few turns.

Defense Evaluation

The paper also explores the efficacy of existing defense mechanisms against Corba. Direct LLM evaluation, where the blocking attack prompt $P_m$ is checked for malicious content, shows limited success in detecting Corba compared to baseline methods. Similarly, integrating LLM-based evaluation into the multi-agent system workflow yields a low interception success rate. Perplexity-based detection, a common method for identifying LLM jailbreaks, reveals that Corba's prompts exhibit perplexity scores nearly identical to normal statements, making them difficult to detect. These findings suggest that existing defense mechanisms are inadequate for mitigating Corba.

Figure 3: LLM Checker for several Attacks.

Figure 4: Agent Monitor for several Attacks.

Conclusion and Future Directions

The paper effectively demonstrates the vulnerability of LLM-MASs to contagious recursive blocking attacks. Corba's ability to propagate through various topologies and evade existing defense mechanisms highlights the need for stronger security measures. The study primarily focuses on exposing these vulnerabilities rather than developing mitigation strategies. Future research should focus on designing effective defense mechanisms to prevent blocking attacks in LLM-MASs. This may involve developing more sophisticated anomaly detection techniques, implementing robust input validation mechanisms, and exploring methods for isolating and containing malicious agents.

Markdown