Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification

Published 30 Jul 2024 in cs.CR and cs.LG | (2407.20859v1)

Abstract: Recently, autonomous agents built on LLMs have experienced significant development and are being deployed in real-world applications. These agents can extend the base LLM's capabilities in multiple ways. For example, a well-built agent using GPT-3.5-Turbo as its core can outperform the more advanced GPT-4 model by leveraging external components. More importantly, the usage of tools enables these systems to perform actions in the real world, moving from merely generating text to actively interacting with their environment. Given the agents' practical applications and their ability to execute consequential actions, it is crucial to assess potential vulnerabilities. Such autonomous systems can cause more severe damage than a standalone LLM if compromised. While some existing research has explored harmful actions by LLM agents, our study approaches the vulnerability from a different perspective. We introduce a new type of attack that causes malfunctions by misleading the agent into executing repetitive or irrelevant actions. We conduct comprehensive evaluations using various attack methods, surfaces, and properties to pinpoint areas of susceptibility. Our experiments reveal that these attacks can induce failure rates exceeding 80\% in multiple scenarios. Through attacks on implemented and deployable agents in multi-agent scenarios, we accentuate the realistic risks associated with these vulnerabilities. To mitigate such attacks, we propose self-examination detection methods. However, our findings indicate these attacks are difficult to detect effectively using LLMs alone, highlighting the substantial risks associated with this vulnerability.

Abstract PDF Upgrade to Chat

References (51)

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that LLM agents are vulnerable to malfunction amplification attacks, significantly impairing their performance.
It employs prompt injection, adversarial perturbation, and multi-agent simulations to evaluate disruption rates exceeding 80% in certain settings.
The study highlights the need for improved detection methods beyond simple self-assessment to safeguard autonomous LLM agent functionality.

Malfunction Amplification as a Threat to LLM Agents

This paper (2407.20859) introduces a novel attack vector against LLM agents, shifting the focus from eliciting harmful behaviors to inducing malfunctions that disrupt normal operations. The study presents a comprehensive evaluation framework to assess the robustness of LLM agents against these attacks, highlighting vulnerabilities across various dimensions such as attack types, methods, and surfaces. The findings suggest that current LLM agents are susceptible to malfunction amplification, with failure rates exceeding 80% in certain scenarios, and that these attacks are difficult to detect effectively using LLMs alone.

LLM Agent Vulnerabilities

The paper identifies that while LLM agents possess powerful automation capabilities, their performance stability remains a concern. It examines how adversarial inputs can exacerbate inherent instabilities in LLM agents, leading to logic errors such as infinite loops or incorrect function executions. This approach draws inspiration from denial-of-service attacks in web security, aiming to render LLM agents ineffective by disrupting their normal functioning rather than inducing overtly harmful actions.

Figure 1: The overview of our attack which exacerbates the instabilities of LLM agents.

The attack methodology involves various techniques, including prompt injection, adversarial perturbation, and adversarial demonstration. Prompt injection, which inserts adversarial commands within user inputs, is found to be particularly effective. Adversarial perturbations, which add noise to the input to disrupt normal response generation using methods like SCPN, VIPER, and GCG, show limited success. Adversarial demonstrations, which provide intentionally incorrect examples to mislead the agent, also prove ineffective.

Multi-Agent Scenario and Realistic Risks

The paper extends the attack scenarios to multi-agent environments, simulating realistic situations where a compromised agent can influence others. As shown in \autoref{figure:advanced}, this can lead to resource wastage or the execution of irrelevant tasks across the agent network.

Figure 2: Advanced attack in multi-agent scenario.

In these advanced attacks, the attacker can embed the targeted benign action in one agent before it communicates with downstream agents, potentially manipulating the system into spamming or other detrimental behaviors.

Defense Mechanisms and Limitations

To mitigate the proposed attacks, the paper explores self-examination detection methods that leverage the LLMs' capability for self-assessment. However, the results indicate that these attacks are more difficult to detect compared to prior approaches that sought overtly harmful actions. Even with enhanced defense mechanisms, the attacks remain effective, highlighting the importance of fully understanding this vulnerability.

Figure 3: Average success rate of infinite loop prompt injection attacks on the agents that are built with the given toolkit.

Further investigation reveals that the tools employed by various agents can influence their susceptibility to attacks. Some toolkits are found to be particularly prone to manipulation, as illustrated in \autoref{figure:toolkit_asr}, while the number of tools or toolkits used in constructing an agent does not strongly correlate with susceptibility.

Evaluation Methodology

The paper employs two evaluation settings: an agent emulator for large-scale batch experiments and case studies with fully implemented agents (Gmail and CSV agents). The agent emulator, which simulates interactions between LLM agents and external components, allows for efficient testing across various toolkits and scenarios. The case studies, using LangChain, provide a realistic assessment of attack performance on implemented agents.

Figure 4: Number of agents in the emulator that is built utilizing the given toolkit.

The evaluation metric focuses on the agent's task performance, measuring the rate of failures or the attack success rate (ASR). The results show that direct manipulations of user input are the most potent, though intermediate outputs from the tools can occasionally enhance certain attacks.

Impact of Tools and Toolkits

The integration of external toolkits and functions is a key aspect of LLM agents, and the research examines whether the usage of certain tools affects the overall attack performance. While the number of toolkits does not show strong correlations with the agent's failure rate, some toolkits make agents easier to manipulate.

Figure 5: Average attack success rate based on the number of tools available in the LLM agent.

However, similar to toolkits, the number of tools used in an Agent does not strongly correlate with the attack success rate, as shown in \autoref{figure:tool}.

Conclusion

The paper identifies a significant vulnerability in LLM agents: their susceptibility to malfunction-inducing attacks. The authors show that these attacks can be deployed to disrupt normal operations and propagate between different agents, leading to resource wastage or execution of irrelevant tasks. They emphasize that simply improving core capabilities can mitigate some attacks, but prompt injection attacks still achieve a relatively high success rate. The authors' results also suggest that the attack is more difficult to detect through simple self-examinations. Overall, the study contributes to a better understanding of the risks associated with these advanced systems, potentially paving the way for more effective safeguard systems in the future.

Markdown