BrowseSafe-Bench: AI Browser Defense Benchmark

Updated 31 January 2026

BrowseSafe-Bench is a benchmark that tests prompt-injection defenses by exposing AI browser agents to realistic HTML attack scenarios with noise and complex structures.
It leverages nearly 15,000 labeled HTML payloads and an 11-type attack taxonomy across three semantic tiers to rigorously measure defense efficacy.
The benchmark addresses real-world challenges by incorporating adversarial techniques, distractor elements, and empirical evaluation metrics to advance browser agent security.

BrowseSafe-Bench is a comprehensive benchmark for evaluating prompt-injection defenses in AI-powered browser agents. Developed to reflect realistic, action-oriented attack scenarios, it systematically exposes models to HTML payloads with complexity and noise profiles characteristic of production web pages. Unlike prior benchmarks, BrowseSafe-Bench directly targets attacks that could influence real-world agent actions rather than merely provoking unauthorized textual output. Its construction draws from large-scale, anonymized tool-call outputs in deployed agents, features multiple attack strategies and semantic variations, and is designed to drive advances in the robustness of prompt-injection detection and mitigation within browser-based AI systems (Zhang et al., 25 Nov 2025).

1. Objectives and Real-World Scope

The primary goal of BrowseSafe-Bench is to measure the efficacy of prompt-injection defenses in AI browser agents when exposed to production-grade HTML, including realistic levels of semantic noise, nested structures, and benign distractors. The benchmark emphasizes detection of attacks capable of triggering agent actions rather than eliciting “jailbreak” responses.

Key objectives include:

Forcing models to go beyond simple keyword pattern matching, accommodating semantic, structural, and linguistic variation.
Exposing agents to adversarial content embedded in a variety of real-world HTML templates, leveraging an anonymized corpus of 100,000 tool-call outputs drawn from a live, production browser agent.
Covering five common web domains (workspace tools, education, social media, entertainment, e-commerce) and eight page/template styles, ranging from semantic HTML5 to attribute-heavy layouts.
Representing both fully compromised sites and malicious user-submitted content as adversary models.

2. Attack Taxonomy

BrowseSafe-Bench structures prompt-injection attacks into three semantic tiers comprising eleven canonical types. This taxonomy focuses on action-oriented impact and variation in evasion difficulty, mapped as predicates on the function HTML→{benign, malicious}.

Semantic Tiers and Example Injection Patterns

Tier	Number & Type	Illustration / Mechanism
Basic Attacks	5 patterns	HTML-comment override, hidden-input injection, data-attribute (e.g., `data-instruction`), ignore-previous, URL-segment injection
Advanced Attacks	4 architectural vectors	Role manipulation, system-prompt exfiltration, delimiter injection, social engineering
Sophisticated Attacks	2 evasion-focused	Indirect/hypothetical instructions; multilanguage (non-English) forms mimicking data-exfiltration

Examples:

HTML-comment override: $H$ 5
Data-attribute injection: $H$ 6
Advanced, system-prompt exfiltration: $H$ 7
Sophisticated, indirect hypothetical: $H$ 8

This taxonomy reflects a spectrum of attack surfaces that require both syntactic and semantic model vigilance.

3. Dataset Structure and Complexity

BrowseSafe-Bench comprises 14,719 labeled HTML payloads: 11,039 for training and 3,680 for testing. Each sample is labeled as benign or malicious, with malicious samples distributed nearly uniformly across the 11 attack types and 9 injection strategies.

Structural Complexity Metrics:

Given an HTML sample $H$ ,

Tag count: $T(H) = \text{number of }<\!tag\!>$ elements in $H$
Max nesting depth: $D(H) = \max_{n\in H}(\text{depth of node }n)$
Distractor count: $C_d(H) = \#\{\text{benign comments, hidden fields, data-* attrs}\}$
Distractor frequency: $f_d(H) = \frac{C_d(H)}{T(H)}$ Aggregated over $N$ samples:

$\overline{T} = \frac1N \sum_{i=1}^N T(H_i),\quad \overline{D} = \frac1N \sum_{i=1}^N D(H_i),\quad \overline{f_d} = \frac1N \sum_{i=1}^N f_d(H_i)$

In BrowseSafe-Bench, $\overline{T} \approx 150$ tags, $\overline{D} \approx 6$ levels, and $T(H) = \text{number of }<\!tag\!>$ 0, matching real-world deployment data.

The dataset’s complexity arises from HTML scaffolds sourced from authentic agent outputs, use of eight page/template styles, and aggressive distractor insertion in both malicious and benign samples.

4. Construction Methodology

BrowseSafe-Bench is built through a multi-step process designed to preserve ecological validity and adversarial sophistication:

Source Real Text: 100,000 anonymized tool-call outputs filtered by domain and length.
Template HTML Scaffolds: Wrapping text in eight distinct page styles to simulate structural diversity.
Distractor Insertion: Both automated LLM-based rewriting and programmatic augmentation (comments, hidden fields, accessibility metadata, fake tokens) to simulate benign environmental noise.
Malicious Injection: Employing both hidden metadata (HTML comments, data-* attributes, CSS-hidden spans) and LLM-driven visible rewrites across paragraphs, lists, footers, tables, and blockquotes.
Context-aware Generation: Domain and brand names are extracted; attacker domains are created by typosquatting; LLMs are given full-page context to ensure plausible, stealthy attack blending.
Quality Gating: Samples subjected to automated filters and human spot-checks for fluency, coherence, and stylistic consistency with intended attack types.

This methodology yields a benchmark whose distribution, distractor frequency, and attack sophistication reflect adversarial pressures encountered in real-world browser deployment.

5. Evaluation Metrics and Analysis

BrowseSafe-Bench supports standard binary classification metrics combined with security-specific measures:

Precision: $T(H) = \text{number of }<\!tag\!>$ 1
Recall: $T(H) = \text{number of }<\!tag\!>$ 2
F $T(H) = \text{number of }<\!tag\!>$ 3 Score: $T(H) = \text{number of }<\!tag\!>$ 4
Balanced Accuracy: $T(H) = \text{number of }<\!tag\!>$ 5, with $T(H) = \text{number of }<\!tag\!>$ 6
Refusals: Count of model refusals to classify, treated as positives for harm assessment.

These metrics capture tradeoffs between false-positive burden (Precision, Specificity) and catch-rate (Recall), under realistic noise introduced by distractors ( $T(H) = \text{number of }<\!tag\!>$ 7) and structurally complex HTML. Degradation in detection accuracy is observed as distractor frequency increases (accuracy from 90.2% with 0 distractors to ≈81% with 3+ distractors).

6. Use in Defense Evaluation and Key Empirical Findings

BrowseSafe-Bench enables empirical assessment across a representative suite of models and defense architectures:

Specialized classifiers: PromptGuard-2 (22M/86M), gpt-oss-safeguard (20B/120B)
Closed-weight LLMs: GPT-5 (mini, Low/Medium/High), Haiku 4.5, Sonnet 4.5 (varied context sizes and operating modes)
BrowseSafe detector: Qwen3-30B, fine-tuned with chunking

Evaluation Protocol:

Full-length HTML (up to 80K tokens, untruncated) per sample
Chunking for outputs over threshold $T(H) = \text{number of }<\!tag\!>$ 8: $T(H) = \text{number of }<\!tag\!>$ 9
Threshold tuning for 1% FPR
Refusals tracked and penalized

Summary of Key Results:

PromptGuard-2: Recall ≈ 0.22, F $H$ 0 ≈ 0.35
Large general LLMs (GPT-5, Sonnet 4.5): F $H$ 1 ≈ 0.84–0.86, but with latency of 2–36 seconds; frequent refusals (Sonnet)
gpt-oss-safeguard (20B): F $H$ 2 ≈ 0.79–0.80
BrowseSafe (fine-tuned Qwen3-30B): F $H$ 3 = 0.904, BalancedAcc = 0.912, Recall = 0.841, Precision = 0.978, inference < 1 s

Difficulty Analysis:

Multilanguage attacks: hardest (BalancedAcc ≈ 76%)
System-exfiltration: easiest (≈ 85%)
Visible, context-aware rewrites (footer, table): hardest strategy
Hidden metadata: easiest strategy
Explicit instructions: 84.9% accuracy; indirect: 77.1%; stealth: 74.6%
Distractor frequency: negatively correlated with detection accuracy

Generalization:

Held-out URLs: F $H$ 4 = 0.935
Held-out attack types: 0.863
Held-out injection strategies: 0.788

Empirical findings reveal persistent vulnerabilities in state-of-the-art LLMs to subtle, semantically blended prompt injections; overfitting in specialized classifiers; and that defense-in-depth architectures — involving chunking, conservative logical aggregation, calibrated thresholds, and context-neutral intervention — provide substantial gains in robustness and latency.

7. Community Release and Impact

BrowseSafe-Bench is freely available to the research community, supporting reproducible and comparative assessment of prompt-injection defenses for browser-based AI agents. Its design, rooted in production data and attack diversity, provides a rigorous multidimensional challenge that:

Exposes limitations of both small, pattern-based safety models and large, generic LLMs,
Benchmarks progress in handling semantic variation and web-scale HTML noise,
Enables the community to prototype, test, and iterate on defense-in-depth strategies reflecting real operational risk (Zhang et al., 25 Nov 2025).

A plausible implication is that future improvements in AI web agent security will increasingly depend on benchmarks that incorporate both ecological validity and adversary-aware scenario design at web scale.

Markdown Report Issue Upgrade to Chat

References (1)

BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BrowseSafe-Bench.