Zero-Shot Embedding Drift Detection (ZEDD)
- The paper introduces a statistical hypothesis test using cosine similarity to quantify semantic drift between suspect and benign prompts.
- It employs paired prompt generation and a threshold-based flagging mechanism to reliably detect both overt and subtle prompt injections.
- ZEDD achieves high accuracy with low false positive rates across diverse LLMs while integrating seamlessly as a pre-processing defense layer.
Zero-Shot Embedding Drift Detection (ZEDD) is a lightweight, zero-shot framework for identifying prompt injection attacks against LLM applications by quantifying semantic changes in embedding space. It operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, offering efficient deployment across diverse LLMs. ZEDD detects both direct and indirect prompt injection attempts by measuring embedding drift between benign and suspect text variants, providing a scalable and low-latency defense layer for LLM-powered systems (Sekar et al., 18 Jan 2026).
1. Formalization of Prompt Injection Detection as Embedding Drift
ZEDD formalizes prompt-injection detection as a two-sample statistical hypothesis test in learned embedding space. Given an input and its clean (benign) counterpart %%%%1%%%%, both are transformed into -dimensional embeddings via a fixed encoder . The embeddings are denoted for the suspect prompt and for the clean prompt.
The task is to distinguish:
- Null hypothesis : (no injection: semantic equivalence).
- Alternative hypothesis : diverges significantly from (injection present).
This hypothesis test relies on quantifying the semantic drift that is typical when adversarial manipulations alter the prompt's intent or content.
2. Embedding Drift Metric
ZEDD uses cosine similarity to measure the semantic proximity between paired embeddings:
The embedding drift score is then defined as:
Interpretation:
- : Near-identical embeddings, unlikely to be injection.
- near $1$: Significant semantic shift, likely indicative of injection.
This drift score is designed to capture both gross and subtle semantic modifications introduced by attacks.
3. ZEDD Algorithm: Zero-Shot Detection Workflow
The ZEDD method comprises three operational phases: construction of prompt pairs, drift computation, and threshold-based flagging.
3.1 Construction of Paired Prompts
- For each injected prompt (from corpora such as LLMail-Inject), generate a clean variant using a constrained LLM rewrite (e.g., via GPT-3.5-turbo) that eliminates adversarial content while retaining benign semantics.
- Optionally, generate clean–clean pairs for calibration.
3.2 Zero-Shot Drift-Based Detection
The detection pipeline is as follows:
- For an incoming suspect prompt , generate or retrieve its clean counterpart .
- Compute and .
- Calculate drift score .
- If (calibrated threshold), flag as injected; otherwise, classify as benign.
3.3 Pseudocode
1 2 3 4 5 6 7 8 9 10 |
def ZEDD(x_s, f, g, tau): x_b = g(x_s) # Generate clean variant e_s = f(x_s) # Suspect prompt embedding e_b = f(x_b) # Clean prompt embedding delta = 1 - cosine_sim(e_b, e_s) # Drift score if delta > tau: flag = 1 # Injection detected else: flag = 0 # Benign return flag |
Main hyperparameters are (drift threshold) and the choice of encoder and generator .
4. Statistical Calibration and Threshold Estimation
Threshold selection relies on statistical modeling, not arbitrary parameterization.
- Collect drift scores on labeled held-out data comprising both clean–clean and injected–clean pairs.
- Fit a two-component Gaussian Mixture Model (GMM) to the scores:
- The optimal threshold is the value where:
- If GMM fitting fails, apply kernel density estimation (KDE) and select at the lowest valley between modes.
- To control false positives, select so that the clean-class FPR (commonly ):
- Compute 95% confidence intervals on detection rates:
where is the category-wise test set size.
5. Experimental Design, Datasets, and Key Results
5.1 LLMail-Inject Dataset
The evaluation relies on the LLMail-Inject dataset (derived from the Microsoft LLMail-Inject Challenge), which encompasses five prompt injection attack types: Jailbreak (J), System Leak (SL), Task Override (TO), Encoding Manipulation (EM), and Prompt Confusion (PC). The dataset construction includes deduplication, GPT-3.5-based English filtering and classification, and constrained clean-variant generation, resulting in 86,000 injected–clean and 86,000 clean–clean pairs. The held-out test set consists of 51,603 pairs (25,801 clean–clean and 25,802 injected–clean).
5.2 Embedding Models
ZEDD is evaluated on four encoders:
- SBERT (all-mpnet-base-v2)
- Llama 3 8B Instruct
- Mistral 7B Instruct
- Qwen 2 7B Instruct
5.3 Metrics
Performance is measured by overall accuracy, precision, recall (on adversarial class), F1, and clean FPR.
5.4 Results
| Encoder | Accuracy | Precision | Recall (adv) | F1 | Clean FPR |
|---|---|---|---|---|---|
| SBERT | 90.75% | 99.65% | 81.78% | 89.84% | 1.7% |
| Llama 3 8B | 95.32% | 95.85% | 94.75% | 95.30% | 5.5% |
| Mistral 7B | 95.55% | 96.58% | 94.45% | 95.50% | 2.3% |
| Qwen 2 7B | 95.46% | 96.27% | 94.52% | 95.38% | 2.2% |
Detection rates per attack type (percent flagged, selected models):
| Model | C | EM | J | PC | SL | TO |
|---|---|---|---|---|---|---|
| SBERT | 1.7% | 95.9% | 86.2% | 90.5% | 91.6% | 86.7% |
| Llama 3 8B | 5.5% | 98.1% | 92.2% | 94.4% | 96.7% | 90.7% |
| Mistral 7B | 2.3% | 98.1% | 92.2% | 93.3% | 96.9% | 90.8% |
| Qwen 2 7B | 2.2% | 98.2% | 90.8% | 94.2% | 96.8% | 90.3% |
Appendix B of the source paper demonstrates that ZEDD outperforms previous embedding-based and supervised classifiers in detection accuracy and operational efficiency (Sekar et al., 18 Jan 2026).
6. Computational Performance and LLM Pipeline Integration
6.1 Computational Complexity
- Embedding extraction: , with the prompt length (operations can be batched).
- Cosine similarity: .
- GMM/KDE scoring: negligible compared to embedding calculation.
Empirical results show prompt-level processing (embedding + drift) takes less than 50 ms on GPU per input.
6.2 Pipeline Integration
ZEDD is slotted as a pre-processing guard before the LLM inference stage, introducing no changes to the LLM weights or runtime. The workflow is:
if pass LLM inference; if fail block or alert
This design allows efficient deployment on top of existing LLM applications with only an extra embedding API call and threshold comparison.
7. Limitations and Potential Research Extensions
7.1 Observed Limitations
- Minimal-semantic-change attacks (e.g., word-order shuffling, paraphrasing) may evade detection due to limited embedding shift.
- Embedding models with insufficient domain sensitivity may reduce drift signal discriminativeness.
- Adaptive attackers may iteratively optimize prompts to stay below the drift threshold .
7.2 Future Directions
- Ensemble approaches: Combine multiple independent embeddings with averaged or composite drift scores.
- Adaptive thresholds: Online tuning of threshold by prompt category or user profile.
- Enhanced features: Augment drift signal with auxiliary lexical or syntactic metrics within a lightweight classifier.
- Few-shot calibration: Adapt thresholds or embedding models to address domain or distributional drift in production.
These extensions could enhance ZEDD’s generality and robustness to new adversarial strategies (Sekar et al., 18 Jan 2026).