Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zero-Shot Embedding Drift Detection (ZEDD)

Updated 25 January 2026
  • The paper introduces a statistical hypothesis test using cosine similarity to quantify semantic drift between suspect and benign prompts.
  • It employs paired prompt generation and a threshold-based flagging mechanism to reliably detect both overt and subtle prompt injections.
  • ZEDD achieves high accuracy with low false positive rates across diverse LLMs while integrating seamlessly as a pre-processing defense layer.

Zero-Shot Embedding Drift Detection (ZEDD) is a lightweight, zero-shot framework for identifying prompt injection attacks against LLM applications by quantifying semantic changes in embedding space. It operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, offering efficient deployment across diverse LLMs. ZEDD detects both direct and indirect prompt injection attempts by measuring embedding drift between benign and suspect text variants, providing a scalable and low-latency defense layer for LLM-powered systems (Sekar et al., 18 Jan 2026).

1. Formalization of Prompt Injection Detection as Embedding Drift

ZEDD formalizes prompt-injection detection as a two-sample statistical hypothesis test in learned embedding space. Given an input xx and its clean (benign) counterpart %%%%1%%%%, both are transformed into dd-dimensional embeddings via a fixed encoder f:TextRdf:\mathrm{Text}\to\mathbb{R}^d. The embeddings are denoted es=f(xs)e_s = f(x_s) for the suspect prompt xsx_s and eb=f(xb)e_b = f(x_b) for the clean prompt.

The task is to distinguish:

  • Null hypothesis H0H_0: esebe_s \approx e_b (no injection: semantic equivalence).
  • Alternative hypothesis H1H_1: ese_s diverges significantly from ebe_b (injection present).

This hypothesis test relies on quantifying the semantic drift that is typical when adversarial manipulations alter the prompt's intent or content.

2. Embedding Drift Metric

ZEDD uses cosine similarity to measure the semantic proximity between paired embeddings:

cos_sim(eb,es)=ebesebes\mathrm{cos\_sim}(e_b, e_s) = \frac{e_b \cdot e_s}{\|e_b\|\|e_s\|}

The embedding drift score Δ\Delta is then defined as:

Δ(eb,es)=1cos_sim(eb,es)=1ebesebes\Delta(e_b, e_s) = 1 - \mathrm{cos\_sim}(e_b, e_s) = 1 - \frac{e_b \cdot e_s}{\|e_b\|\|e_s\|}

Interpretation:

  • Δ0\Delta \approx 0: Near-identical embeddings, unlikely to be injection.
  • Δ\Delta near $1$: Significant semantic shift, likely indicative of injection.

This drift score is designed to capture both gross and subtle semantic modifications introduced by attacks.

3. ZEDD Algorithm: Zero-Shot Detection Workflow

The ZEDD method comprises three operational phases: construction of prompt pairs, drift computation, and threshold-based flagging.

3.1 Construction of Paired Prompts

  • For each injected prompt (from corpora such as LLMail-Inject), generate a clean variant xbx_b using a constrained LLM rewrite (e.g., via GPT-3.5-turbo) that eliminates adversarial content while retaining benign semantics.
  • Optionally, generate clean–clean pairs for calibration.

3.2 Zero-Shot Drift-Based Detection

The detection pipeline is as follows:

  1. For an incoming suspect prompt xsx_s, generate or retrieve its clean counterpart xbx_b.
  2. Compute es=f(xs)e_s = f(x_s) and eb=f(xb)e_b = f(x_b).
  3. Calculate drift score Δ=1cos_sim(eb,es)\Delta = 1 - \mathrm{cos\_sim}(e_b, e_s).
  4. If Δ>τ\Delta > \tau (calibrated threshold), flag as injected; otherwise, classify as benign.

3.3 Pseudocode

1
2
3
4
5
6
7
8
9
10
def ZEDD(x_s, f, g, tau):
    x_b = g(x_s)                   # Generate clean variant
    e_s = f(x_s)                   # Suspect prompt embedding
    e_b = f(x_b)                   # Clean prompt embedding
    delta = 1 - cosine_sim(e_b, e_s)  # Drift score
    if delta > tau:
        flag = 1                   # Injection detected
    else:
        flag = 0                   # Benign
    return flag

Main hyperparameters are τ\tau (drift threshold) and the choice of encoder ff and generator gg.

4. Statistical Calibration and Threshold Estimation

Threshold selection relies on statistical modeling, not arbitrary parameterization.

  • Collect drift scores {Δi}\{\Delta_i\} on labeled held-out data comprising both clean–clean and injected–clean pairs.
  • Fit a two-component Gaussian Mixture Model (GMM) to the scores:

p(Δ)=wcleanN(μclean,σclean2)+winjN(μinj,σinj2)p(\Delta) = w_\mathrm{clean} \cdot \mathcal{N}(\mu_\mathrm{clean}, \sigma_\mathrm{clean}^2) + w_\mathrm{inj} \cdot \mathcal{N}(\mu_\mathrm{inj}, \sigma_\mathrm{inj}^2)

  • The optimal threshold τ\tau is the value where:

wcleanN(τ;μclean,σclean2)=winjN(τ;μinj,σinj2)w_\mathrm{clean} \cdot \mathcal{N}(\tau; \mu_\mathrm{clean}, \sigma_\mathrm{clean}^2) = w_\mathrm{inj} \cdot \mathcal{N}(\tau; \mu_\mathrm{inj}, \sigma_\mathrm{inj}^2)

  • If GMM fitting fails, apply kernel density estimation (KDE) and select τ\tau at the lowest valley between modes.
  • To control false positives, select τ\tau so that the clean-class FPR α\leq \alpha (commonly α=3%\alpha=3\%):

PΔGclean(Δ>τ)αP_{\Delta \sim G_\mathrm{clean}} (\Delta > \tau) \leq \alpha

  • Compute 95% confidence intervals on detection rates:

CI=p^±z0.975p^(1p^)N\mathrm{CI} = \hat{p} \pm z_{0.975}\sqrt{\frac{\hat{p}(1-\hat{p})}{N}}

where NN is the category-wise test set size.

5. Experimental Design, Datasets, and Key Results

5.1 LLMail-Inject Dataset

The evaluation relies on the LLMail-Inject dataset (derived from the Microsoft LLMail-Inject Challenge), which encompasses five prompt injection attack types: Jailbreak (J), System Leak (SL), Task Override (TO), Encoding Manipulation (EM), and Prompt Confusion (PC). The dataset construction includes deduplication, GPT-3.5-based English filtering and classification, and constrained clean-variant generation, resulting in 86,000 injected–clean and 86,000 clean–clean pairs. The held-out test set consists of 51,603 pairs (25,801 clean–clean and 25,802 injected–clean).

5.2 Embedding Models

ZEDD is evaluated on four encoders:

5.3 Metrics

Performance is measured by overall accuracy, precision, recall (on adversarial class), F1, and clean FPR.

5.4 Results

Encoder Accuracy Precision Recall (adv) F1 Clean FPR
SBERT 90.75% 99.65% 81.78% 89.84% 1.7%
Llama 3 8B 95.32% 95.85% 94.75% 95.30% 5.5%
Mistral 7B 95.55% 96.58% 94.45% 95.50% 2.3%
Qwen 2 7B 95.46% 96.27% 94.52% 95.38% 2.2%

Detection rates per attack type (percent flagged, selected models):

Model C EM J PC SL TO
SBERT 1.7% 95.9% 86.2% 90.5% 91.6% 86.7%
Llama 3 8B 5.5% 98.1% 92.2% 94.4% 96.7% 90.7%
Mistral 7B 2.3% 98.1% 92.2% 93.3% 96.9% 90.8%
Qwen 2 7B 2.2% 98.2% 90.8% 94.2% 96.8% 90.3%

Appendix B of the source paper demonstrates that ZEDD outperforms previous embedding-based and supervised classifiers in detection accuracy and operational efficiency (Sekar et al., 18 Jan 2026).

6. Computational Performance and LLM Pipeline Integration

6.1 Computational Complexity

  • Embedding extraction: O(dL)O(dL), with LL the prompt length (operations can be batched).
  • Cosine similarity: O(d)O(d).
  • GMM/KDE scoring: negligible compared to embedding calculation.

Empirical results show prompt-level processing (embedding + drift) takes less than 50 ms on GPU per input.

6.2 Pipeline Integration

ZEDD is slotted as a pre-processing guard before the LLM inference stage, introducing no changes to the LLM weights or runtime. The workflow is:

user input[ZEDD check]\text{user input} \to \text{[ZEDD check]} \to if pass \to LLM inference; if fail \to block or alert

This design allows efficient deployment on top of existing LLM applications with only an extra embedding API call and threshold comparison.

7. Limitations and Potential Research Extensions

7.1 Observed Limitations

  • Minimal-semantic-change attacks (e.g., word-order shuffling, paraphrasing) may evade detection due to limited embedding shift.
  • Embedding models with insufficient domain sensitivity may reduce drift signal discriminativeness.
  • Adaptive attackers may iteratively optimize prompts to stay below the drift threshold τ\tau.

7.2 Future Directions

  • Ensemble approaches: Combine multiple independent embeddings f1,,fkf_1,\ldots,f_k with averaged or composite drift scores.
  • Adaptive thresholds: Online tuning of threshold τi\tau_i by prompt category or user profile.
  • Enhanced features: Augment drift signal with auxiliary lexical or syntactic metrics within a lightweight classifier.
  • Few-shot calibration: Adapt thresholds or embedding models to address domain or distributional drift in production.

These extensions could enhance ZEDD’s generality and robustness to new adversarial strategies (Sekar et al., 18 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zero-Shot Embedding Drift Detection (ZEDD).