LLMail-Inject Dataset
- LLMail-Inject dataset is a large-scale corpus for evaluating adversarial prompt injection attacks via email channels.
- It organizes 171,999 prompt pairs across five attack categories, enabling precise benchmarking of detection frameworks like ZEDD.
- The dataset supports advances in instruction-data separation and LLM pipeline security through detailed annotations and systematic preprocessing.
The LLMail-Inject dataset is a large-scale corpus created to enable systematic, realistic evaluation of adversarial prompt injection attacks against LLM-powered systems, specifically focusing on indirect injections through email channels. Originating from Microsoft’s LLMail-Inject Adaptive Prompt Injection Challenge, the dataset aggregates adversarial email prompts authored by security researchers, red teams, and open challenge participants. LLMail-Inject serves both as a primary evaluation resource for state-of-the-art detection frameworks—such as @@@@1@@@@ (ZEDD)—and as a benchmark for empirical and theoretical advances in instruction-data separation, adversarial robustness, and LLM pipeline security (Sekar et al., 18 Jan 2026, Abdelnabi et al., 11 Jun 2025).
1. Dataset Origin and Collection Protocol
The LLMail-Inject dataset is seeded from the publicly available LLMail-Inject challenge repository, which collected adversarial email payloads intended to trigger unauthorized tool calls in simulated LLM-based email assistants. Participants submitted crafted emails with embedded or obfuscated instructions, aiming to manipulate the assistant’s behavior without explicit input from the end user. The core scenario involves an LLM-powered assistant with access to a synthetic inbox, ability to answer user queries, and a single tool call (send_email_*), which adversaries sought to activate covertly.
The challenge incorporated multiple retrieval levels—ranging from simple recency-based retrieval to keyword-driven selection and exfiltration of embedded secrets. Defense strategies spanned preventive (Spotlighting), detective (Prompt Shield, TaskTracker), and adjudicator (LLM Judge) models. Over two phases (Dec 2024–Apr 2025), the challenge yielded 461,640 raw email submissions from 839 participants, encompassing 208,095 unique attack attempts (Abdelnabi et al., 11 Jun 2025).
2. Dataset Structure and Categorization
Each record in the LLMail-Inject dataset comprises:
- Unique submission identifier (UUID).
- Email subject and body (attack payload).
- Scenario metadata (retrieval level, defense, LLM version).
- Boolean and categorical flags (retrieval status, defense evasion, tool-call success, output correctness).
- Timestamps (submission, scheduling, completion).
- LLM-based annotation fields (injection attempt, strategy classification).
Following collection, the data underwent systematic preprocessing:
- Deduplication reduced raw prompts from 461,640 to 179,920 unique items.
- Language filtering via FastText restricted the set to English (172,875 items).
- Automated category assignment with GPT-3.5-turbo labeled each prompt into five injection classes: Jailbreak (J), System Leak (SL), Task Override (TO), Encoding Manipulation (EM), and Prompt Confusion (PC).
- Each injected prompt was paired with a semantically aligned, LLM-generated “clean” version with malicious instructions removed.
- For benchmarking, an equal set of clean–clean pairs was assembled, wherein both variants result from independent constrained rewrites (Sekar et al., 18 Jan 2026).
The final corpus consists of 171,999 prompt pairs, evenly split between injected–clean and clean–clean variants. The test partition contains 51,603 pairs (25,801 clean–clean; 25,802 injected–clean) with proportional representation across all five attack categories.
3. Annotation Methodology and Labeling Biases
All filtering, labeling, and rewriting operations were performed automatically using GPT-3.5-turbo-0125, with no manual crowdsourcing or adjudication. Each prompt was tagged according to semantic attack goal. Clean rewrites were generated using a safety-driven system prompt, aiming to preserve the nominal “task” while removing adversarial content. The absence of inter-annotator agreement or human validation introduces the possibility of systematic LLM-induced labeling biases and noisy category boundaries. A plausible implication is that taxonomy assignment may not match human expert consensus for borderline cases. (Sekar et al., 18 Jan 2026)
4. Quantitative Properties and Splitting Scheme
Statistical analysis of the processed dataset reveals:
- Total size: 171,999 prompt pairs.
- Train/Test split: 70% for threshold calibration and embedder fine-tuning (≈120k pairs), 30% held out for final evaluation (≈51k pairs).
- Rewriting statistics: Average prompt length increases from ~1415 characters (raw injected) to ~1752 characters (final clean pairs), reflecting the verbosity added by LLM rewrites.
- Test partition composition: Maintains category proportionality consistent with the original injected challenge data.
Within training, 10% of pairs are used for fine-tuning the embedding encoder, with the remainder for fitting unsupervised detection thresholds. No variance data on prompt length is reported.
5. Evaluation Usage and Benchmarking Protocols
LLMail-Inject underpins benchmarking for prompt-injection detectors, notably ZEDD (Sekar et al., 18 Jan 2026). ZEDD’s workflow computes semantic drift via cosine similarity between embeddings of injected and clean prompts:
A two-component Gaussian Mixture Model (GMM) fit to drift scores supports threshold selection, subject to a clean false positive cap of 3%. Kernel Density Estimation is used when GMM fitting fails.
Benchmark results on the test slice (51,603 pairs):
| Encoder | Accuracy (%) | Precision (%) | Recall (adv) (%) | F1 (%) | Clean FPR (%) |
|---|---|---|---|---|---|
| SBERT | 90.75 | 99.65 | 81.78 | 89.84 | 1.7 |
| Llama-3 | 95.32 | 95.85 | 94.75 | 95.30 | 5.5 |
| Mistral | 95.55 | 96.58 | 94.45 | 95.50 | 2.3 |
| Qwen2 | 95.46 | 96.27 | 94.52 | 95.38 | 2.2 |
Category-wise detection rates exceed 86% in all classes, typically >90% for Encoding Manipulation (EM), Prompt Confusion (PC), and System Leak (SL). Jailbreak and Task Override are detected at 90–92% (Sekar et al., 18 Jan 2026).
6. Limitations and Future Extensions
The LLMail-Inject dataset exhibits several limitations:
- Category-Level Ambiguity: Detection precision is slightly reduced for Jailbreak, Encoding Manipulation, and System Leak instances, especially with SBERT embeddings, indicating some semantic similarity to benign prompts.
- Embedding Model Dependence: Performance is contingent on embedding encoder capacity and domain alignment. Resource-constrained or misaligned embedders may underperform, whereas very large encoders risk latency trade-offs.
- Scope Constraint: The corpus is currently restricted to email-formatted, indirect injection scenarios; semi-structured channels such as web forms and chat logs are not represented.
- Labeling Bias: Automatic category assignment by GPT-3.5 may propagate systematic biases and introduce class boundary noise (Sekar et al., 18 Jan 2026).
Anticipated future directions include broadening to other input modalities, few-shot drift calibration techniques, and adaptive thresholding schemes responsive to evolving embedding distributions.
7. Representative Samples and Research Applications
LLMail-Inject contains archetypal attack payloads including direct JSON instruction, base64-obfuscated commands, and special-token framed role prompts. Each is mapped to trigger specific LLM behaviors and tool calls, supporting granular analysis of instruction/data separation failures (Abdelnabi et al., 11 Jun 2025).
Potential research applications:
- Benchmarking and cross-comparison of defense strategies under controlled, realistic conditions.
- Adversarial training pipelines utilizing injection-labeled subsets for classifier or end-to-end system optimization.
- Retrieval and defense co-design studies, elucidating interactions between document ranking and model robustness.
- Structural assessment of instruction/data separation mechanisms (e.g., ASIDE frameworks).
- Ensemble and conformal blocklist method evaluation, targeting paraphrase-invariant detection (Abdelnabi et al., 11 Jun 2025).
LLMail-Inject thus establishes itself as a cornerstone dataset for advancing robust, adaptive, and scalable prompt injection defense methodologies in LLM-centric email and semi-structured input domains.