- The paper introduces WASP, a benchmark evaluating web agents' resilience to prompt injection attacks using realistic simulation environments.
- The methodology employs staged environments like GitLab and Reddit with diverse injection templates to assess both intermediate and end-to-end hijacking.
- Results indicate that limited agent capabilities can inadvertently defend against attacks, emphasizing the need for robust instruction hierarchies and enhanced security prompts.
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
Introduction
The paper "WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks" focuses on the security vulnerabilities of web navigation agents powered by LLMs when exposed to prompt injection attacks. These agents are susceptible to adversarial manipulations wherein external malicious prompts disrupt their intended operations.
Figure 1: Snapshot of the results on our benchmark. ASR--intermediate checks whether the agent backed with this model was hijacked and diverted from the original user objective, whereas ASR--end-to-end checks whether the attacker's goal was achieved or not.
Motivation and Background
Web navigation AI agents are increasingly deployed for text-based task automation through natural language processing capabilities. However, due to inherent vulnerabilities in LLMs, these agents risk being manipulated by adversaries embedding malicious instructions within webpages. This highlights the necessity to explore robust defense mechanisms against prompt injections that hijack agent decisions.
Methodology: WASP Benchmark
Overview
WASP introduces an isolated environment to evaluate web agent hijacking, offering a realistic benchmarking platform. By leveraging environments like GitLab and Reddit within VisualWebArena, WASP assesses agents against human-written malicious prompt injections across diverse operational scenarios without risking real user interactions or compromising live systems.

Figure 2: Screenshots of the websites after malicious prompts were injected. Attacker creates an issue on GitLab (left) or a post on Reddit (right) encouraging the agent to follow new instruction.
Attacker Goals
Attacker goals are framed around common security violations such as data exfiltration, access gain, and misinformation dissemination. These objectives provide clear benchmarks to evaluate whether an agent follows through with malicious instructions leading to genuine security concerns.
Prompt Injection Pipeline
Agents encounter two types of prompt injections: plain-text embedded malicious instructions and URL manipulations incorporating hidden adversary objectives. Templates for these scenarios are designed to critically evaluate the agent's decision-making capabilities under hostile circumstances.
Experimental Results
The experimental setup across various model architectures and prompt types illustrated vulnerabilities in agents’ ability to discern and execute legitimate requests. While intermediate hijacking (ASR--intermediate) was prevalent, comprehensive goal achievement by attackers (ASR--end-to-end) was less common, pointing to the role of agentic competence as a potential security buffer.
Figure 3: Flow of each of the 21 attacker goals across three main steps during agent execution: first action, intermediate steps, and final outcome.
Analysis and Mitigation
Agents demonstrated a tendency to switch objectives upon exposure to malicious prompts. Especially notable was the “security via incompetence” phenomenon, where limited agent capabilities inadvertently deter comprehensive adversary success. Proposed mitigations include deploying instruction hierarchy defenses and enhancing system prompts to deter attentional diversion by embedded threats.
Conclusion
WASP provides a critical testbed for technical scrutiny and improvement of web agents regarding prompt injection vulnerabilities. It highlights existing security gaps and challenges the research community to develop more robust defenses aligning technological advances in AI agents with rigorous security protocols.
Future Work
Expanding WASP to encompass more diverse environments and enhancing its attack prompt library will further refine agent security evaluations. Addressing limitations in current benchmarks can promote resilience in agentic systems against evolving security threats in dynamic web interactions.