Zero-Shot Embedding Drift Detection (ZEDD)

Updated 25 January 2026

The paper introduces a statistical hypothesis test using cosine similarity to quantify semantic drift between suspect and benign prompts.
It employs paired prompt generation and a threshold-based flagging mechanism to reliably detect both overt and subtle prompt injections.
ZEDD achieves high accuracy with low false positive rates across diverse LLMs while integrating seamlessly as a pre-processing defense layer.

Zero-Shot Embedding Drift Detection (ZEDD) is a lightweight, zero-shot framework for identifying prompt injection attacks against LLM applications by quantifying semantic changes in embedding space. It operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, offering efficient deployment across diverse LLMs. ZEDD detects both direct and indirect prompt injection attempts by measuring embedding drift between benign and suspect text variants, providing a scalable and low-latency defense layer for LLM-powered systems (Sekar et al., 18 Jan 2026).

1. Formalization of Prompt Injection Detection as Embedding Drift

ZEDD formalizes prompt-injection detection as a two-sample statistical hypothesis test in learned embedding space. Given an input $x$ and its clean (benign) counterpart $x_b$ , both are transformed into $d$ -dimensional embeddings via a fixed encoder $f:\mathrm{Text}\to\mathbb{R}^d$ . The embeddings are denoted $e_s = f(x_s)$ for the suspect prompt $x_s$ and $e_b = f(x_b)$ for the clean prompt.

The task is to distinguish:

Null hypothesis $H_0$ : $e_s \approx e_b$ (no injection: semantic equivalence).
Alternative hypothesis $H_1$ : $x_b$ 0 diverges significantly from $x_b$ 1 (injection present).

This hypothesis test relies on quantifying the semantic drift that is typical when adversarial manipulations alter the prompt's intent or content.

2. Embedding Drift Metric

ZEDD uses cosine similarity to measure the semantic proximity between paired embeddings:

$x_b$ 2

The embedding drift score $x_b$ 3 is then defined as:

$x_b$ 4

Interpretation:

$x_b$ 5: Near-identical embeddings, unlikely to be injection.
$x_b$ 6 near $x_b$ 7: Significant semantic shift, likely indicative of injection.

This drift score is designed to capture both gross and subtle semantic modifications introduced by attacks.

3. ZEDD Algorithm: Zero-Shot Detection Workflow

The ZEDD method comprises three operational phases: construction of prompt pairs, drift computation, and threshold-based flagging.

3.1 Construction of Paired Prompts

For each injected prompt (from corpora such as LLMail-Inject), generate a clean variant $x_b$ 8 using a constrained LLM rewrite (e.g., via GPT-3.5-turbo) that eliminates adversarial content while retaining benign semantics.
Optionally, generate clean–clean pairs for calibration.

3.2 Zero-Shot Drift-Based Detection

The detection pipeline is as follows:

For an incoming suspect prompt $x_b$ 9, generate or retrieve its clean counterpart $d$ 0.
Compute $d$ 1 and $d$ 2.
Calculate drift score $d$ 3.
If $d$ 4 (calibrated threshold), flag as injected; otherwise, classify as benign.

3.3 Pseudocode

$e_s = f(x_s)$ 8

Main hyperparameters are $d$ 5 (drift threshold) and the choice of encoder $d$ 6 and generator $d$ 7.

4. Statistical Calibration and Threshold Estimation

Threshold selection relies on statistical modeling, not arbitrary parameterization.

Collect drift scores $d$ 8 on labeled held-out data comprising both clean–clean and injected–clean pairs.
Fit a two-component Gaussian Mixture Model (GMM) to the scores:

$d$ 9

The optimal threshold $f:\mathrm{Text}\to\mathbb{R}^d$ 0 is the value where:

$f:\mathrm{Text}\to\mathbb{R}^d$ 1

If GMM fitting fails, apply kernel density estimation (KDE) and select $f:\mathrm{Text}\to\mathbb{R}^d$ 2 at the lowest valley between modes.
To control false positives, select $f:\mathrm{Text}\to\mathbb{R}^d$ 3 so that the clean-class FPR $f:\mathrm{Text}\to\mathbb{R}^d$ 4 (commonly $f:\mathrm{Text}\to\mathbb{R}^d$ 5):

$f:\mathrm{Text}\to\mathbb{R}^d$ 6

Compute 95% confidence intervals on detection rates:

$f:\mathrm{Text}\to\mathbb{R}^d$ 7

where $f:\mathrm{Text}\to\mathbb{R}^d$ 8 is the category-wise test set size.

5. Experimental Design, Datasets, and Key Results

5.1 LLMail-Inject Dataset

The evaluation relies on the LLMail-Inject dataset (derived from the Microsoft LLMail-Inject Challenge), which encompasses five prompt injection attack types: Jailbreak (J), System Leak (SL), Task Override (TO), Encoding Manipulation (EM), and Prompt Confusion (PC). The dataset construction includes deduplication, GPT-3.5-based English filtering and classification, and constrained clean-variant generation, resulting in 86,000 injected–clean and 86,000 clean–clean pairs. The held-out test set consists of 51,603 pairs (25,801 clean–clean and 25,802 injected–clean).

5.2 Embedding Models

ZEDD is evaluated on four encoders:

SBERT (all-mpnet-base-v2)
Llama 3 8B Instruct
Mistral 7B Instruct
Qwen 2 7B Instruct

5.3 Metrics

Performance is measured by overall accuracy, precision, recall (on adversarial class), F1, and clean FPR.

5.4 Results

Encoder	Accuracy	Precision	Recall (adv)	F1	Clean FPR
SBERT	90.75%	99.65%	81.78%	89.84%	1.7%
Llama 3 8B	95.32%	95.85%	94.75%	95.30%	5.5%
Mistral 7B	95.55%	96.58%	94.45%	95.50%	2.3%
Qwen 2 7B	95.46%	96.27%	94.52%	95.38%	2.2%

Detection rates per attack type (percent flagged, selected models):

Model	C	EM	J	PC	SL	TO
SBERT	1.7%	95.9%	86.2%	90.5%	91.6%	86.7%
Llama 3 8B	5.5%	98.1%	92.2%	94.4%	96.7%	90.7%
Mistral 7B	2.3%	98.1%	92.2%	93.3%	96.9%	90.8%
Qwen 2 7B	2.2%	98.2%	90.8%	94.2%	96.8%	90.3%

Appendix B of the source paper demonstrates that ZEDD outperforms previous embedding-based and supervised classifiers in detection accuracy and operational efficiency (Sekar et al., 18 Jan 2026).

6. Computational Performance and LLM Pipeline Integration

6.1 Computational Complexity

Embedding extraction: $f:\mathrm{Text}\to\mathbb{R}^d$ 9, with $e_s = f(x_s)$ 0 the prompt length (operations can be batched).
Cosine similarity: $e_s = f(x_s)$ 1.
GMM/KDE scoring: negligible compared to embedding calculation.

Empirical results show prompt-level processing (embedding + drift) takes less than 50 ms on GPU per input.

6.2 Pipeline Integration

ZEDD is slotted as a pre-processing guard before the LLM inference stage, introducing no changes to the LLM weights or runtime. The workflow is:

$e_s = f(x_s)$ 2 if pass $e_s = f(x_s)$ 3 LLM inference; if fail $e_s = f(x_s)$ 4 block or alert

This design allows efficient deployment on top of existing LLM applications with only an extra embedding API call and threshold comparison.

7. Limitations and Potential Research Extensions

7.1 Observed Limitations

Minimal-semantic-change attacks (e.g., word-order shuffling, paraphrasing) may evade detection due to limited embedding shift.
Embedding models with insufficient domain sensitivity may reduce drift signal discriminativeness.
Adaptive attackers may iteratively optimize prompts to stay below the drift threshold $e_s = f(x_s)$ 5.

7.2 Future Directions

Ensemble approaches: Combine multiple independent embeddings $e_s = f(x_s)$ 6 with averaged or composite drift scores.
Adaptive thresholds: Online tuning of threshold $e_s = f(x_s)$ 7 by prompt category or user profile.
Enhanced features: Augment drift signal with auxiliary lexical or syntactic metrics within a lightweight classifier.
Few-shot calibration: Adapt thresholds or embedding models to address domain or distributional drift in production.

These extensions could enhance ZEDD’s generality and robustness to new adversarial strategies (Sekar et al., 18 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zero-Shot Embedding Drift Detection (ZEDD).