PromptSleuth-Bench: Evaluating Prompt Injection

Updated 13 January 2026

PromptSleuth-Bench is a comprehensive evaluation suite designed to detect prompt injection in large language models using systematic, multi-task adversarial tests.
It categorizes attacks into system prompt forgery, user prompt camouflage, and model behavior manipulation, stratified into easy, medium, and hard difficulty tiers.
The benchmark employs detection-centric metrics like FPR and FNR for rigorous, cross-method comparisons, exposing weaknesses in template-based defenses.

PromptSleuth-Bench is a comprehensive, systematically-constructed evaluation suite for prompt injection detection in LLMs. It is designed to subsume and extend existing benchmarks by introducing new manipulation techniques, difficulty tiers, and multi-task adversarial scenarios, enabling rigorous comparison of prompt-injection defenses across a broad spectrum of attack classes and evaluation conditions (Wang et al., 28 Aug 2025).

1. Development and Motivation

Prompt injection (PI)—the exploitation of LLM prompt interpretation via crafted adversarial inputs—has escalated in significance due to the rapidly increasing adoption of LLMs in interactive and autonomous systems. Existing benchmarks (e.g., DataSentinel-Bench and AgentDojo) exhibit limited coverage: typically, they focus either on single-task adversarial modifications or tool-augmented, multi-step agent tasks, without systematically targeting emerging PI strategies. PromptSleuth-Bench was constructed to explicitly address these deficiencies. Its objectives are to (i) provide an encompassing attack taxonomy, (ii) enforce coverage of both single- and multi-task use cases, (iii) stress-test defenses under compositional, intent-camouflaged, and multi-segment injection attempts, and (iv) standardize cross-method comparison via detection-oriented metrics (Wang et al., 28 Aug 2025).

2. Benchmark Construction and Taxonomy

PromptSleuth-Bench is defined as a superset of two influential prior efforts: DataSentinel-Bench, which focuses on canonical single-task PI, and AgentDojo, which emphasizes dynamic, multi-step agent tasks.

Attack taxonomy in PromptSleuth-Bench is organized into three principal categories:

System Prompt Forgery: Adversarially forging or appending to privileged system messages, bypassing standard user input boundaries.
User Prompt Camouflage: Concealing malicious instructions within otherwise benign-looking user text via:
- Context Tampering (embedding covert directives in innocuous completions)
- Instruction Wrapping (hiding payloads inside structured formats, e.g., JSON, XML)
- Payload Splitting (fragmenting an attack across prompt segments, reconstructing only at inference)
Model Behavior Manipulation: Crafting inputs that induce behavioral deviation through mechanisms such as emotional appeals, reward framing, threat-based coercion, or narrative tampering.

PromptSleuth-Bench extends the attack space by incorporating new variants for each of these categories, specifically targeting both surface-level cues and semantic invariance. Notably, no pseudocode is supplied for attack generation—rather, attacks are defined by their conversational and structural patterns (Wang et al., 28 Aug 2025).

Multi-Task Adversarial Scenarios

A signature innovation of PromptSleuth-Bench is the inclusion of multi-task (‘Hard’ tier) adversarial prompts, which embed several legitimate and adversarial tasks in a single user message. For example, prompts may request a translation, insert a covert system instruction, then demand an unrelated summarization. These scenarios necessitate that detection frameworks reason about true user intent, not just lexical or syntactic artifacts.

Difficulty Tiers

PromptSleuth-Bench stratifies attacks into:

Easy: Canonical, single-task modifications (e.g., instruction override)
Medium: Single-task but with advanced, structurally involved modifications (e.g., emotional manipulation, instruction wrapping)
Hard: Multi-task adversarial prompts with both new and pre-existing techniques intermixed

3. Evaluation Protocol and Metrics

PromptSleuth-Bench emphasizes robust, adversarially meaningful evaluation. It adopts detection-centric metrics:

False Positive Rate (FPR): $|\{\text{benign prompts flagged as malicious}\}|\,/\,|\{\text{benign prompts}\}|$
False Negative Rate (FNR): $|\{\text{malicious prompts missed}\}|\,/\,|\{\text{malicious prompts}\}|$

These metrics are reported per defense method and per difficulty tier. While precision, recall, and $F_1$ -score can be computed from the binary outcomes, the principal metrics are FPR and FNR, which directly quantify over-blocking and under-detection, respectively. The protocol involves evaluating a defense's binary output (benign/malicious) on each prompt, then aggregating rates over the labeled ground truth sets (Wang et al., 28 Aug 2025).

No absolute prompt count or per-category breakdown is specified; only aggregate metrics over {Easy, Medium, Hard} tiers and over three datasets—DataSentinel-Bench, PromptSleuth-Bench, and AgentDojo—are reported.

4. Comparative Defense Analysis

PromptSleuth-Bench enables rigorous, cross-methodology benchmarking. For selected defenses and datasets, published aggregate rates demonstrate clear separation:

Defense	Dataset	FPR	FNR
DataSentinel	DataSentinel-Bench	0.0000	0.0000
PromptArmor	PromptSleuth-Bench	0.0926	0.0825
PromptSleuth-5	PromptSleuth-Bench	0.0008	0.0007
SecAlign	AgentDojo	0.3855	0.8664

Notable phenomena from these results:

On PromptSleuth-Bench, PromptSleuth (GPT-4.1-mini and GPT-5-mini variants) achieves near-zero FNR, outperforming both DataSentinel and template baselines.
Medium and Hard tiers expose critical weaknesses: traditional template defenses exhibit FPR as high as $1.00$ (medium) and $0.83-0.90$ (hard), or FNR exceeding 0.99, indicating severe over-blocking or missed attacks in structurally complex scenarios.
Easy tier remains trivial for template- and rule-based methods.

PromptSleuth-Bench thus exposes the brittleness of surface-pattern-based defenses under adversarial composition and semantic obfuscation, and highlights the advantage of intent-invariant reasoning frameworks (Wang et al., 28 Aug 2025).

5. Usage and Implementation Guidelines

Operationalization of PromptSleuth-Bench requires:

Acquiring the dataset, consisting of folders Easy, Medium, and Hard with prompts labeled as benign or malicious.
For each prompt $P$ , the defense under test (typically an LLM classifier or semantic intent module) outputs a binary classification.
Aggregation of FPR and FNR over ground truth subsets.
Filtering and analysis by attack class, tier, or other configuration parameters (such as backbone model or detection threshold).

There is no reference implementation ("benchmark runner"), but this evaluation pipeline is directly specified (Wang et al., 28 Aug 2025).

6. Context Within the Prompt Security Ecosystem

PromptSleuth-Bench is positioned as a superset of DataSentinel-Bench and AgentDojo, covering both single-task and complex, multi-task attack modalities. Its introduction aligns with parallel lines of research investigating prompt injection, prompt extraction, and prompt sensitivity in LLMs (Wang et al., 2024, Razavi et al., 9 Feb 2025, Polo et al., 2024). Unlike extraction-focused benchmarks (Raccoon) or sensitivity studies (PromptSET), PromptSleuth-Bench is squarely focused on the detection of semantic and structural manipulations that subvert model intent, particularly in the presence of evolving, multi-intent adversaries.

Notably, PromptSleuth-Bench reveals the inadequacy of surface or template-based defenses, particularly as adversaries deploy obfuscation, splitting, or intent-camouflaged strategies. The benchmark’s coverage and difficulty spectrum are critical for driving advances in intent-discriminative defenses and semantic reasoning models (Wang et al., 28 Aug 2025).

7. Significance and Perspectives

PromptSleuth-Bench represents a systematic, unified methodology for evaluating prompt injection resilience. By extending the benchmark to multi-task, context-camouflaged, and behaviorally manipulative attack patterns, it enables comprehensive stress-testing and future-proofs the evaluation of LLM-centric security mechanisms.

A plausible implication is that benchmarks of this type will catalyze the development of new classes of intent-invariant and semantically-aware defense algorithms, as well as meta-evaluation methodologies capable of dynamic threat surface extension. Future work may focus on further extending PromptSleuth-Bench with generative adversarial attacks, multi-turn dialogue injection, and integration with real-time, production LLM systems (Wang et al., 28 Aug 2025).