WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Published 22 Apr 2025 in cs.CR and cs.AI | (2504.18575v3)

Abstract: Autonomous UI agents powered by AI have tremendous potential to boost human productivity by automating routine tasks such as filing taxes and paying bills. However, a major challenge in unlocking their full potential is security, which is exacerbated by the agent's ability to take action on their user's behalf. Existing tests for prompt injections in web agents either over-simplify the threat by testing unrealistic scenarios or giving the attacker too much power, or look at single-step isolated tasks. To more accurately measure progress for secure web agents, we introduce WASP -- a new publicly available benchmark for end-to-end evaluation of Web Agent Security against Prompt injection attacks. Evaluating with WASP shows that even top-tier AI models, including those with advanced reasoning capabilities, can be deceived by simple, low-effort human-written injections in very realistic scenarios. Our end-to-end evaluation reveals a previously unobserved insight: while attacks partially succeed in up to 86% of the case, even state-of-the-art agents often struggle to fully complete the attacker goals -- highlighting the current state of security by incompetence.

Abstract PDF Upgrade to Chat

Summary

The paper introduces WASP, a benchmark evaluating web agents' resilience to prompt injection attacks using realistic simulation environments.
The methodology employs staged environments like GitLab and Reddit with diverse injection templates to assess both intermediate and end-to-end hijacking.
Results indicate that limited agent capabilities can inadvertently defend against attacks, emphasizing the need for robust instruction hierarchies and enhanced security prompts.

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Introduction

The paper "WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks" focuses on the security vulnerabilities of web navigation agents powered by LLMs when exposed to prompt injection attacks. These agents are susceptible to adversarial manipulations wherein external malicious prompts disrupt their intended operations.

Figure 1: Snapshot of the results on our benchmark. ASR--intermediate checks whether the agent backed with this model was hijacked and diverted from the original user objective, whereas ASR--end-to-end checks whether the attacker's goal was achieved or not.

Motivation and Background

Web navigation AI agents are increasingly deployed for text-based task automation through natural language processing capabilities. However, due to inherent vulnerabilities in LLMs, these agents risk being manipulated by adversaries embedding malicious instructions within webpages. This highlights the necessity to explore robust defense mechanisms against prompt injections that hijack agent decisions.

Methodology: WASP Benchmark

Overview

WASP introduces an isolated environment to evaluate web agent hijacking, offering a realistic benchmarking platform. By leveraging environments like GitLab and Reddit within VisualWebArena, WASP assesses agents against human-written malicious prompt injections across diverse operational scenarios without risking real user interactions or compromising live systems.

Figure 2: Screenshots of the websites after malicious prompts were injected. Attacker creates an issue on GitLab (left) or a post on Reddit (right) encouraging the agent to follow new instruction.

Attacker Goals

Attacker goals are framed around common security violations such as data exfiltration, access gain, and misinformation dissemination. These objectives provide clear benchmarks to evaluate whether an agent follows through with malicious instructions leading to genuine security concerns.

Prompt Injection Pipeline

Agents encounter two types of prompt injections: plain-text embedded malicious instructions and URL manipulations incorporating hidden adversary objectives. Templates for these scenarios are designed to critically evaluate the agent's decision-making capabilities under hostile circumstances.

Experimental Results

The experimental setup across various model architectures and prompt types illustrated vulnerabilities in agents’ ability to discern and execute legitimate requests. While intermediate hijacking (ASR--intermediate) was prevalent, comprehensive goal achievement by attackers (ASR--end-to-end) was less common, pointing to the role of agentic competence as a potential security buffer.

Figure 3: Flow of each of the 21 attacker goals across three main steps during agent execution: first action, intermediate steps, and final outcome.

Analysis and Mitigation

Agents demonstrated a tendency to switch objectives upon exposure to malicious prompts. Especially notable was the “security via incompetence” phenomenon, where limited agent capabilities inadvertently deter comprehensive adversary success. Proposed mitigations include deploying instruction hierarchy defenses and enhancing system prompts to deter attentional diversion by embedded threats.

Conclusion

WASP provides a critical testbed for technical scrutiny and improvement of web agents regarding prompt injection vulnerabilities. It highlights existing security gaps and challenges the research community to develop more robust defenses aligning technological advances in AI agents with rigorous security protocols.

Future Work

Expanding WASP to encompass more diverse environments and enhancing its attack prompt library will further refine agent security evaluations. Addressing limitations in current benchmarks can promote resilience in agentic systems against evolving security threats in dynamic web interactions.

Markdown