Open-Prompt-Injections Benchmark
- Open-prompt injection benchmarks are standardized frameworks that evaluate the robustness of LLMs and autonomous agents against adversarial payloads.
- They incorporate diverse attack types such as direct override, goal hijacking, and context leakage across multiple modalities like text, images, and web actions.
- Reproducible evaluation pipelines with unified metrics (e.g., ASR, TPR, utility retention) drive systematic research on enhancing model resilience.
Open-Prompt-Injections Benchmark
Open-prompt-injections benchmarks are publicly available standardized suites, frameworks, and datasets designed to rigorously evaluate the robustness of LLMs, retrieval-augmented generation (RAG) systems, autonomous web agents, and their defenses against prompt injection attacks. These attacks are characterized by the adversarial insertion of commands or payloads into user input, retrieved context, or application data, with the goal of diverting the model from its intended task or subverting safety controls. Recent benchmarks span diverse modalities (text, image, web action), attack vectors (direct override, goal hijacking, leakage), and application settings, and have catalyzed systematic, reproducible research in measuring and improving the resistance of LLM-integrated systems to adversarial prompt manipulation.
1. Formal Foundations and Threat Modeling
Open-prompt-injection benchmarks formalize the adversarial interaction as a transformation of the canonical prompt —where is the application/system instruction and is user or context data—into a compromised prompt . Here, an attacker crafts such that the model carries out an injected task rather than the intended task (Liu et al., 2023).
Threat models are precisely delineated:
- Source of injection: user input, retrieved documents, external context, or environmental factors (webpages, images).
- Attacker capability: black-box (output queries), white-box (full model knowledge), partial control (e.g., posting in restricted webpage elements) (Evtimov et al., 22 Apr 2025, Ramakrishnan et al., 19 Nov 2025).
- Goal types:
- Jailbreak: overriding policies (e.g., “Ignore all previous instructions and…”).
- Goal/Task Hijacking: diverting from benign to attacker-specified output.
- Prompt/Context Leakage: extracting hidden system prompts, credentials, or internal state (Li et al., 2024, Ramakrishnan et al., 19 Nov 2025).
Operations may be evaluated under simple concatenative models or more complex multi-turn, multi-modal agentic settings.
2. Benchmark Design: Attack Taxonomy, Datasets, and Scenarios
Benchmarks systematically assemble diverse sets of malicious and benign samples to capture the landscape of prompt-injection threats.
Attack Taxonomies
- Direct Override: Explicit commands to ignore preceding instructions.
- Indirect/Contextual: Subtle manipulations, e.g., embedding commands in retrieved content, obfuscation, instruction wrapping, semantic drift (Wang et al., 28 Aug 2025, Ramakrishnan et al., 19 Nov 2025).
- Extraction/Leakage: Coaxing disclosure of confidential model context (system prompt, API keys, etc.) (Li et al., 2024).
GenTel-Bench partitions into three major categories (jailbreak, goal hijacking, leakage) and 28 real-world risk scenarios (e.g., violence, privacy breach, IP theft), generating over 84,000 malicious prompts and an equal number of balanced benign controls (Li et al., 2024). Multi-category datasets such as WAInjectBench incorporate both textual and image-based attacks, including imperceptible image perturbations that prompt web agents to execute unauthorized actions (Liu et al., 1 Oct 2025).
Scenario Composition
- Task Diversity: Seven to ten core application domains—QA, code, summarization, translation, sentiment, duplicate detection, etc. (Liu et al., 2023, Pan et al., 1 May 2025, Ganiuly et al., 3 Nov 2025).
- Modalities: Structured text, conversational data, RAG/retrieval context, agent state, UI elements, screenshots (Jacob et al., 25 Jan 2025, Ramakrishnan et al., 19 Nov 2025, Liu et al., 1 Oct 2025).
- Injection Points: Webpage comment fields/URLs (WASP), email/messages (WAInjectBench), RAG-retrieved passages (Securing AI Agents Benchmark), user/completion parts of application prompts (PromptShield, GenTel-Bench).
3. Evaluation Pipelines, Metrics, and Protocols
All benchmarks define reproducible, multi-stage pipelines and unified metrics:
Pipeline Stages
- Dataset curation: Assembling and annotating malicious/benign samples; balancing per scenario (Li et al., 2024, Jacob et al., 25 Jan 2025, Liu et al., 1 Oct 2025).
- Attack generation: Handcrafted, template-based, LLM-paraphrased (GenTel-Bench); adaptive search (OET); fuzzing (PROMPTFUZZ) (Yu et al., 2024, Pan et al., 1 May 2025).
- Agent/model under test: Configurable to closed/open LLMs, web agents, RAG pipelines, tool-calling loops (Evtimov et al., 22 Apr 2025, Ramakrishnan et al., 19 Nov 2025).
- Automated evaluation: Rule-based, LLM-based, or classifier-based scoring of attack success, utility retention, and defensive bypass.
Standard Metrics
| Metric | Formula / Description |
|---|---|
| Attack Success Rate (ASR) | |
| Attack Success Probability (ASP) | Fraction of successful + hesitant (uncertain) responses (Wang et al., 20 May 2025) |
| Utility Under Attack | Rate at which benign task still completes under adversarial conditions (WASP) |
| False Positive Rate (FPR) | |
| True Positive Rate (TPR), Recall | |
| Semantic/Fidelity Metrics | BLEU, embedding cosine similarity (IIM in (Ganiuly et al., 3 Nov 2025)) |
| Task Performance Retention | Defense-on/off output accuracy ratio (Ramakrishnan et al., 19 Nov 2025) |
Advanced frameworks (e.g., Prompt Injection as an Emerging Threat) introduce metrics such as Resilience Degradation Index (RDI), Safety Compliance Coefficient (SCC), Instructional Integrity Metric (IIM), and Unified Resilience Score (URS) to jointly quantify performance and semantic shift (Ganiuly et al., 3 Nov 2025).
4. Detectors, Defenses, and Comparative Analysis
Benchmarks operationalize both preventive and detective strategies on standardized splits.
- Detection-based defenses: LLM-based classifiers, embedding-based anomaly detectors, fine-tuned models (e.g., GenTel-Shield (Li et al., 2024), PromptShield (Jacob et al., 25 Jan 2025)), and semantic intent reasoning (PromptSleuth (Wang et al., 28 Aug 2025)).
- Prevention-based defenses: Prompt paraphrasing, input isolation, system-prompt guardrails, multi-stage anomaly filters (Liu et al., 2023, Ramakrishnan et al., 19 Nov 2025).
- Response verification and alignment defenses: Rule-based and transformer-based behavioral consistency checking (Ramakrishnan et al., 19 Nov 2025).
- Composed, layered defense: e.g., the WASP and Securing AI Agents Benchmarks use pipelined filtering, guardrails, and response checks.
Empirical findings are consistently benchmarked using TPR@FPR , accuracy, and class-level breakdowns, revealing that larger models, while achieving superior utility, are not inherently more resilient; alignment and guardrail sophistication are pivotal (Jacob et al., 25 Jan 2025, Ganiuly et al., 3 Nov 2025).
5. Key Empirical Findings, Attack/Defense Results, and Limitations
Recent large-scale evaluations document the following principal findings:
- Attack Efficacy: Sophisticated prompt injections (Combined Attack, Hypnotism, ignore-prefix) reach ASS/ASP/ASR of $0.8$–$1.0$ on major LLMs; even highly aligned models sustain nonzero vulnerability rates (Liu et al., 2023, Wang et al., 20 May 2025, Pan et al., 1 May 2025).
- Model Alignment: Models with RLHF or multi-stage safety tuning (e.g., GPT-4, Llama-3) are more robust (ASP ), while moderate-scale open models are distinctly less resilient (Wang et al., 20 May 2025, Ganiuly et al., 3 Nov 2025).
- Detection/Defense Effectiveness: Fine-tuned detectors (PromptShield, GenTel-Shield) achieve TPR@1% FPR, whereas prompt-guarding, embedding, or template-based methods rapidly degrade under adaptive/multi-task attacks.
- Modality and Visibility: Web agent injection attacks that lack explicit textual cues or leverage imperceptible image perturbations evade all known detectors (TPR ) (Liu et al., 1 Oct 2025).
- Retention of Utility: Defensive layers introduce only moderate decreases in baseline utility (2–5%), indicating feasible protection with careful tuning (Ramakrishnan et al., 19 Nov 2025).
Limitations cited across benchmark papers include restricted domain/task coverage, dependence on English-only prompts, reliance on handcrafted attack generation, and potential bias from LLM-based scoring or detection (Liu et al., 2023, Evtimov et al., 22 Apr 2025, Jacob et al., 25 Jan 2025, Ramakrishnan et al., 19 Nov 2025).
6. Usage Guidelines, Reproducibility, and Community Integration
Open-prompt-injection benchmarks are universally published with code/scripts, Docker or container support, and documented APIs. Core practices for adoption in research:
- Clone the provided repositories, install standardized dependencies (PyPI requirements or Conda environment) (Li et al., 2024, Jacob et al., 25 Jan 2025, Ramakrishnan et al., 19 Nov 2025).
- Use containerized environments and fixed random seeds for replicability (Ganiuly et al., 3 Nov 2025).
- Standardize protocol: define attack/test splits, preprocess through unified script pipelines, and log outputs for benchmark-specific metrics.
- Add new tasks, models, attacks, or defensive modules via modular plug-ins; contributions are encouraged with PRs on platforms such as GitHub.
- Benchmark suitability spans LLM-integrated applications, RAG pipelines, autonomous web agents, and multi-modal systems.
Benchmarks such as GenTel-Bench, PromptShield, WAInjectBench, OET, WASP, and PromptSleuth are openly licensed, with clear schemas and scripts for extension (Li et al., 2024, Jacob et al., 25 Jan 2025, Liu et al., 1 Oct 2025, Pan et al., 1 May 2025, Evtimov et al., 22 Apr 2025, Wang et al., 28 Aug 2025). This reproducibility and extensibility ensure continued relevance and adaptation to emerging prompt injection modalities.
7. Future Directions and Open Challenges
Several community-endorsed directions and unsolved challenges are highlighted:
- Expansion to new languages, modalities (audio, video), and UI/desktop agents beyond web and text (Evtimov et al., 22 Apr 2025, Liu et al., 1 Oct 2025).
- Automated, optimization-based attack discovery and generation (rather than all hand-crafted) to emulate evolving adversaries (Pan et al., 1 May 2025, Yu et al., 2024).
- Robustness to stealthy, cross-modality, and multi-task prompt-injection attacks; formal guarantees and certified defenses remain open (Liu et al., 1 Oct 2025, Wang et al., 28 Aug 2025).
- Unified, extensible benchmarks that combine static, adaptive, and agentic/in-context attack settings for head-to-head evaluation.
- Mitigation advances: cryptographic prompt signing, real-time detection, adversarial fine-tuning, and continual updating of attack scenarios and defensive data (Li et al., 2024, Wang et al., 20 May 2025, Ramakrishnan et al., 19 Nov 2025).
In summary, open-prompt-injection benchmarks form a rigorous, evolving foundation for evaluating the adversarial robustness of modern LLM-integrated systems, driving research towards practical and theoretically robust mitigation of prompt injection threats across modalities and applications.