Prompt Injection Datasets
- Prompt injection datasets are comprehensive collections that enable the evaluation of LLM defenses against both direct and indirect adversarial prompts.
- They employ varied methodologies, including automated variant generation, multilabel annotation, and multimodal segmentation, to simulate real-world attack scenarios.
- They provide actionable metrics such as attack success rate, tool invocation, and defense precision, guiding the development of more resilient security measures.
Prompt injection datasets provide the empirical foundation for evaluating, benchmarking, and strengthening defenses against the manipulation of LLMs via adversarial prompts. These corpora encompass both direct and indirect injection attacks, test cases for detection and over-defense assessment, variant generation for robustness benchmarking, and multimodal manipulations relevant to web agent and application security. The construction, diversity, labeling protocols, and analytic methodologies vary considerably between datasets, reflecting the evolving complexity of threat models, defense architectures, and research priorities.
1. Dataset Taxonomy and Construction Principles
Prompt injection datasets can be stratified by attack modality, attack source, diversity of content, and annotation protocols:
- Direct vs. Indirect Attacks: Many early corpora center on direct attacks, where an adversary operates as the user. Recent benchmarks, notably LLMail-Inject (Abdelnabi et al., 11 Jun 2025), foreground indirect attacks by embedding adversarial instructions into external data streams, such as emails, retrieved documents, or search results. Datasets such as that built by (Chen et al., 23 Feb 2025) systematize direct comparison between clean and injection-modified documents using SQuAD and TriviaQA-derived samples.
- Manual vs. Automated Generation: The Maatphor framework (Salem et al., 2023) automates prompt variant generation using LLM-guided mutation with a feedback loop, enabling structured coverage over multiple attack styles and increasing diversity beyond hand-crafted samples. Universal gradient-based methods (Liu et al., 2024) propose formal loss minimization and enable synthetic injection string generation that generalizes far beyond training examples.
- Attack Type Representation: Datasets systematically span override, completion, combined, context manipulation, data exfiltration, cross-context contamination, and obfuscation, as seen in PromptShield (Jacob et al., 25 Jan 2025), Securing AI Agents (Ramakrishnan et al., 19 Nov 2025), and WAInjectBench (Liu et al., 1 Oct 2025).
- Scale and Scope: LLMail-Inject is the largest currently, with 208,095 unique attack prompts, whereas curated evaluation sets such as NotInject (Li et al., 2024) focus on a smaller number of carefully filtered benign samples for over-defense diagnosis.
2. Schema, Annotation, and Labeling Protocols
Datasets vary in labeling granularity and annotation pipelines:
- Raw Submission Metadata: LLMail-Inject stores highly detailed JSON records for every attack submission, including team identifiers, timestamps, scenario, objectives (per email), and LLM outputs. Defense efficacy is tagged at the individual submission level.
- Binary Classification: HackAPrompt-sourced datasets (Shaheer et al., 14 Dec 2025, Jacob et al., 25 Jan 2025) and the SQuAD/TriviaQA-based indirect injection benchmark (Chen et al., 23 Feb 2025) use Boolean malicious/benign labels, sometimes with attack-type metadata.
- Multilabel Evaluation/Annotations: Datasets supporting defense benchmarking may include labels for attack success, defense bypass, benign/false-positive (as in NotInject (Li et al., 2024)), LLM-annotator labels (attack vs. unclear vs. clean), and sub-level completion flags.
- Injection Variant Metadata: Maatphor and gradient-based universal datasets store iteration indices, variant lineage, template position, success rates, required/forbidden phrases, and methods used for evaluation (string match, similarity).
- Multimodal Segmentation: WAInjectBench (Liu et al., 1 Oct 2025) organizes both text segments and image samples, annotating for explicit instructions and observer modality (embedded image, screenshot).
3. Domain Coverage, Attack Taxonomies, and Defense Benchmarks
Datasets catalog attacks in multiple operational domains and threat scenarios:
- Email, Web, and Application Workflows: LLMail-Inject simulates an LLM-based email assistant that interacts with real external emails, driving agents to invoke tool calls under attacker influence.
- Retrieval-Augmented Generation (RAG) Agents: Securing AI Agents (Ramakrishnan et al., 19 Nov 2025) delineates five categories (direct injection, context manipulation, override, data exfiltration, cross-context contamination) across customer support, finance, and similar domains.
- Web Agents and Multimodal Threats: WAInjectBench structures attacks by attacker goal, capability (HTML injection, image perturbation), and knowledge level, spanning both text (comment, email, interface) and image-based (adversarial perturbation, popups) routes.
- Benign/Over-Defense Evaluation: NotInject (Li et al., 2024) tests defense models’ false-positive rates by embedding high-risk trigger words into strictly non-malicious sentences, crossing four topical categories (queries, techniques, virtual creation, multilingual).
4. Evaluation Metrics and Empirical Benchmarks
Quantitative metrics underpin dataset utility:
- Attack Success Rate (ASR): Fraction of attack submissions yielding desired adversarial side-effects (unauthorized tool call, injected output string). LLMail-Inject reports ASR ~ 0.8% for Phase 1, declining with defense sophistication.
- Tool Invocation Rate: In LLMail-Inject, proportion of trials generating an external action (send_email invocation), used as a sub-level benchmark.
- Team Success Rate (TSR): Measures collective success per defense/scenario; LLMail-Inject reports important inter-defense and inter-LLM TSR deltas (e.g., LLM Judge ≈ 0.32; TaskTracker ≈ 0.44).
- Defense Recall/Precision/F₁/ROC: PromptShield, InjecGuard, and related works report standard classification metrics, with particular focus on low-FPR thresholds, crucial for practical deployments with benign-heavy traffic. NotInject provides over-defense accuracy; InjecGuard achieves 87.3% in that regime (Li et al., 2024).
- Variant Success: Maatphor computes per-variant success rates under repeated trials, quantifies diversity by n-gram overlap, and measures convergence speed (best variant ≥60% in ≤40 iterations).
- Removal and Mitigation Efficacy: The indirect injection defense study (Chen et al., 23 Feb 2025) reports both detection TPR/FPR and removal rates for segmentation/extraction models, achieving ≥95–100% removal for certain attack positions and types.
5. Access, Licensing, and Practical Utility
Open science principles prevail:
- Repository and Licensing: Most datasets are released under permissive licenses (MIT, Apache 2.0), with URLs provided in each paper. LLMail-Inject and WAInjectBench host extensive metadata and code on HuggingFace and GitHub.
- Data Format: JSON Lines (.jsonl) or NDJSON is standard, with clearly documented field schemas. Large-scale structured sets support both traditional ML and neural model training pipelines.
- Benchmark Guidance: Authors recommend against training production models directly on attack corpora to prevent leakage of adversarial techniques; LLMail-Inject, for instance, warns about overfitting to exploit strategies.
- Evaluation Use Cases: Datasets serve to calibrate new detectors, stress-test prompt guardrails, develop end-to-end defense pipelines, assess adaptation and transferability between attack styles, and systematically benchmark model vulnerabilities.
6. Research Significance, Findings, and Implications
Prompt injection datasets have driven key advances:
- Adaptive Adversary Coverage: LLMail-Inject’s adaptive challenge exposed limitations of “static” defenses and revealed a strong recall boost for ensemble methods (from 0.6 to ≈0.998 on attacks triggering send_email).
- Variant and Universality: Maatphor and universal-gradient datasets established the necessity of systematically generated variants; reliance on hand-crafted prompts underestimates attack viability. Automated variants reach ASR levels—static (0.81), semi-dynamic (0.39), dynamic (0.50)—untouchable by manual baselines (Liu et al., 2024).
- Over-Defense Diagnosis: NotInject’s high trigger word concentration illustrated how prior guardrails falter, dropping benign classification accuracy to near random guessing levels (60%); retrained detectors do markedly better (Li et al., 2024).
- Modalities and Adaptation: WAInjectBench demonstrated that image- and text-based attacks require concurrent multimodal detectors; context-agnostic models fail against imperceptible or obfuscated adversarial inputs.
- Indirect Attack/Defense Cycle: Segmentation and extraction-based mitigation (Chen et al., 23 Feb 2025) push detection research into post-processing and active removal, subsequently driving down attack persistence (ASR).
A plausible implication is that robust, scalable prompt injection defenses are contingent on high-fidelity, diverse, and systematically labeled datasets—especially those representing adaptive adversaries, variants, indirect channels, and over-defense traps—across both unimodal and multimodal domains. Continued open dissemination and method transparency are essential for further research progress and practical safeguard deployment.