- The paper introduces Agent Security Bench (ASB), a framework that formalizes and benchmarks diverse attack types on LLM-based agents.
- It evaluates 23 attack/defense methods across 10 scenarios and 13 LLM architectures using 8 metrics, noting average attack success rates exceeding 84.30%.
- The study reveals that current defenses, including paraphrasing and LLM-based detection, are largely ineffective, urging improvements in AI agent security.
Introduction
The emergence of LLMs as core components of AI agents has enabled these systems to interact with external tools and utilize memory mechanisms to tackle complex tasks across various domains. However, these agents are susceptible to numerous security vulnerabilities, which have not been thoroughly evaluated in existing literature. The paper introduces the Agent Security Bench (ASB), a comprehensive framework for formalizing, benchmarking, and evaluating both attacks and defenses targeting LLM-based agents across ten diverse scenarios.
LLM Agent Attacking Framework
LLM-based agents function through defined processes involving system prompts, user task instructions, memory retrievals, and tool execution. This structured operation can be exploited at different stages, making them vulnerable to a variety of attacks.
Figure 1: Overview of the LLM Agent Attacking Framework, including Direct Prompt Injections (DPI), Observation Prompt Injections (OPI), Plan-of-Thought (PoT) @@@@10@@@@, and Memory Poisoning Attacks, which target the user query, observations, system prompts, and memory retrieval respectively during action planning and execution.
Key attack types include:
- Direct Prompt Injections (DPI): Modifications directly to the user prompt.
- Observation Prompt Injections (OPI): Manipulations within observation data retrieved during task execution.
- Plan-of-Thought (PoT) Backdoor Attacks: Utilizing hidden instructions within system prompts to perform unintended actions.
- Memory Poisoning Attacks: Inserting malicious plans within the agent's memory, thereby corrupting future decision-making processes.
Figure 2: Illustration of four attack types targeting LLM agents. Direct Prompt Injections (DPI) manipulate the user prompt, Observation Prompt Injections (OPI) alter observation data to interfere with later actions, Plan-of-Thought (PoT) Backdoor Attack triggers hidden actions upon specific inputs, and Memory Poisoning Attack injects malicious plans into the agent's memory, causing the agent to utilize attacker-specified tools.
These attacks span multiple scenarios, showcasing significant attack success rates, exceeding 84.30% on average, while highlighting limited efficacy of existing defense mechanisms.
Evaluation and Benchmarking Setup
The ASB framework comprises an extensive evaluation environment that includes:
- 10 Scenarios covering various fields like finance and autonomous driving, utilizing over 400 tools.
- 23 Attack/Defense Methods evaluated across 13 different LLM architectures including popular models like LLaMA3 and GPT-3.5 Turbo.
- 8 Evaluation Metrics to measure effectiveness, such as attack success rates (ASR) and refuse rates, indicating the agent's ability to reject unsafe requests.
Benchmark results underscore the critical vulnerabilities at different operational stages of LLM agents. Despite numerous defense strategies, current methodologies are largely ineffective against sophisticated attacks, leading to high average attack success rates.
Practical Implementations and Defenses
While existing defenses are inadequate, several techniques are explored to ameliorate these vulnerabilities:
Conclusion
ASB highlights the pressing need for advancements in securing LLM-based agents, serving as a pivotal resource for developing resilient defenses. The paper suggests focusing future efforts on enhancing the robustness of AI agents against increasingly complex adversarial strategies.
The work emphasizes the necessity for pioneering robust defense mechanisms to safeguard LLM-based agents, to ensure secure deployment in critical application domains.