Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Published 3 Oct 2024 in cs.CR and cs.AI | (2410.02644v4)

Abstract: Although LLM-based agents, powered by LLMs, can use external tools and memory mechanisms to solve complex real-world tasks, they may also introduce critical security vulnerabilities. However, the existing literature does not comprehensively evaluate attacks and defenses against LLM-based agents. To address this, we introduce Agent Security Bench (ASB), a comprehensive framework designed to formalize, benchmark, and evaluate the attacks and defenses of LLM-based agents, including 10 scenarios (e.g., e-commerce, autonomous driving, finance), 10 agents targeting the scenarios, over 400 tools, 27 different types of attack/defense methods, and 7 evaluation metrics. Based on ASB, we benchmark 10 prompt injection attacks, a memory poisoning attack, a novel Plan-of-Thought backdoor attack, 4 mixed attacks, and 11 corresponding defenses across 13 LLM backbones. Our benchmark results reveal critical vulnerabilities in different stages of agent operation, including system prompt, user prompt handling, tool usage, and memory retrieval, with the highest average attack success rate of 84.30\%, but limited effectiveness shown in current defenses, unveiling important works to be done in terms of agent security for the community. We also introduce a new metric to evaluate the agents' capability to balance utility and security. Our code can be found at https://github.com/agiresearch/ASB.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Agent Security Bench (ASB), a framework that formalizes and benchmarks diverse attack types on LLM-based agents.
It evaluates 23 attack/defense methods across 10 scenarios and 13 LLM architectures using 8 metrics, noting average attack success rates exceeding 84.30%.
The study reveals that current defenses, including paraphrasing and LLM-based detection, are largely ineffective, urging improvements in AI agent security.

"Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents" (2410.02644)

Introduction

The emergence of LLMs as core components of AI agents has enabled these systems to interact with external tools and utilize memory mechanisms to tackle complex tasks across various domains. However, these agents are susceptible to numerous security vulnerabilities, which have not been thoroughly evaluated in existing literature. The paper introduces the Agent Security Bench (ASB), a comprehensive framework for formalizing, benchmarking, and evaluating both attacks and defenses targeting LLM-based agents across ten diverse scenarios.

LLM Agent Attacking Framework

LLM-based agents function through defined processes involving system prompts, user task instructions, memory retrievals, and tool execution. This structured operation can be exploited at different stages, making them vulnerable to a variety of attacks.

Figure 1: Overview of the LLM Agent Attacking Framework, including Direct Prompt Injections (DPI), Observation Prompt Injections (OPI), Plan-of-Thought (PoT) @@@@10@@@@, and Memory Poisoning Attacks, which target the user query, observations, system prompts, and memory retrieval respectively during action planning and execution.

Key attack types include:

Direct Prompt Injections (DPI): Modifications directly to the user prompt.
Observation Prompt Injections (OPI): Manipulations within observation data retrieved during task execution.
Plan-of-Thought (PoT) Backdoor Attacks: Utilizing hidden instructions within system prompts to perform unintended actions.
Memory Poisoning Attacks: Inserting malicious plans within the agent's memory, thereby corrupting future decision-making processes.
Figure 2: Illustration of four attack types targeting LLM agents. Direct Prompt Injections (DPI) manipulate the user prompt, Observation Prompt Injections (OPI) alter observation data to interfere with later actions, Plan-of-Thought (PoT) Backdoor Attack triggers hidden actions upon specific inputs, and Memory Poisoning Attack injects malicious plans into the agent's memory, causing the agent to utilize attacker-specified tools.

These attacks span multiple scenarios, showcasing significant attack success rates, exceeding 84.30% on average, while highlighting limited efficacy of existing defense mechanisms.

Evaluation and Benchmarking Setup

The ASB framework comprises an extensive evaluation environment that includes:

10 Scenarios covering various fields like finance and autonomous driving, utilizing over 400 tools.
23 Attack/Defense Methods evaluated across 13 different LLM architectures including popular models like LLaMA3 and GPT-3.5 Turbo.
8 Evaluation Metrics to measure effectiveness, such as attack success rates (ASR) and refuse rates, indicating the agent's ability to reject unsafe requests.

Benchmark results underscore the critical vulnerabilities at different operational stages of LLM agents. Despite numerous defense strategies, current methodologies are largely ineffective against sophisticated attacks, leading to high average attack success rates.

Practical Implementations and Defenses

While existing defenses are inadequate, several techniques are explored to ameliorate these vulnerabilities:

Paraphrasing and Delimiters: Attempt to neutralize injected instructions, albeit with limited success in reducing ASR significantly.
LLM-based Detection and Shuffle Algorithms: For memory poisoning and PoT attacks respectively, these have shown potential yet need refinement for practical application.
Figure 3: LLM-based Defense Result for Memory Attack. The defense mechanisms against memory attacks are largely ineffective.

Conclusion

ASB highlights the pressing need for advancements in securing LLM-based agents, serving as a pivotal resource for developing resilient defenses. The paper suggests focusing future efforts on enhancing the robustness of AI agents against increasingly complex adversarial strategies.

The work emphasizes the necessity for pioneering robust defense mechanisms to safeguard LLM-based agents, to ensure secure deployment in critical application domains.

Markdown