Adversarial Illusions in Multi-Modal Embeddings

Updated 1 February 2026

Adversarial illusions are imperceptible perturbations that mislead multi-modal embedding spaces by aligning a source input with an attacker-chosen target.
Techniques like projected gradient descent, universal perturbations, and flow-based attacks achieve near-perfect misclassification and hallucination rates in models.
Existing defenses such as adversarial training and anomaly detection only partially mitigate the threat, highlighting the need for robust cross-modal calibration.

Adversarial illusions in multi-modal embeddings are targeted perturbations that disrupt cross-modal semantic alignment by manipulating input data such that its embedding matches an attacker-chosen target from a different modality, all while remaining imperceptible to human observers. These attacks exploit the shared embedding spaces underlying multi-modal AI models (image–text, audio–text, video–text, etc.), undermining model reliability across classification, retrieval, captioning, generative tasks, and embodied decision-making. Current approaches demonstrate that adversarial illusions present a fundamental threat to both the semantic integrity and practical deployment safety of foundation models, with cross-modal transferability and resilience to conventional defenses posing major open challenges (Zhang et al., 2023, Salman et al., 2024, Dou et al., 2024, Liao et al., 17 May 2025, Lu, 17 Sep 2025, Hoscilowicz et al., 25 Nov 2025, Kim et al., 2024, Vu et al., 29 Jan 2025, Kumar et al., 23 Oct 2025, Islam et al., 11 Feb 2025, Akbarian et al., 26 Nov 2025, Chang et al., 31 Jan 2025, Shayegani et al., 1 Apr 2025, Noever et al., 2021) [FMM-Attack: (Li et al., 2024)].

1. Formal Definition and Mechanisms

Multi-modal encoders map inputs from heterogeneous modalities (e.g. images $x$ , text $t$ , audio $a$ ) to a shared embedding space $\mathbb{R}^{D}$ via modality-specific encoders $f_{m}$ (image), $f_{T}$ (text), etc. Adversarial illusions are constructed by perturbing a source input $x$ with $\delta$ ( $\|\delta\|_{p}\leq\epsilon$ ) so that its embedding $f_{m}(x+\delta)$ closely matches the embedding $t$ 0 of a chosen target $t$ 1 in another modality:

$t$ 2

where $t$ 3 is typically a cosine or Euclidean distance loss. This reasoning generalizes to cross-modal tasks, task-agnostic transfer, and black-box scenarios via API queries or surrogate alignment (Zhang et al., 2023, Salman et al., 2024). Most approaches use iterative projected gradient descent or specialized optimizers adapted to the structure of the embedding space, enforcing norm-constraints for imperceptibility (e.g. $t$ 4 balls of radius $t$ 5 $t$ 6– $t$ 7 for images, or small magnitudes for audio) and confirm embedding proximity (cosine $t$ 8 0.995, SSIM $t$ 9 0.98 omnipresent). Novel variants include flow-based attacks for video LLMs, universal (image-agnostic) perturbations, and mutual-modality optimization integrating both visual and textual channels (Hoscilowicz et al., 25 Nov 2025, Kim et al., 2024, Ye et al., 2023) [FMM-Attack: (Li et al., 2024)].

2. Taxonomy of Adversarial Illusions and Exemplary Attacks

The following major families are documented:

Gradient-based Single-Modal Alignment: PGD on pixel values to match a target text embedding with visually imperceptible changes in images. 100% success achieved on CLIP and ImageBind (Salman et al., 2024).
CrossFire Attack: Transforms the target input into the source modality (text $a$ 0 image/audio), adversarially aligns embeddings using PGD on normalized space, substantially improving cross-modal hijacking (Dou et al., 2024).
Universal Perturbations: Single $a$ 1 crafted to induce misalignment across a large set of images and/or prompts by targeting internal attention value vectors, yielding near-total model collapse on VQA, classification, and captioning (Kim et al., 2024).
Topological Signature Attacks: Characterize and detect discrepancy in the topological structure (persistent homology) between clean and adversarial embeddings to quantify alignment distortion (Vu et al., 29 Jan 2025).
Role-Modality Structural Attacks (RMA): Manipulate the prompt structure (role tokens and image token placement) to elicit harmful responses from MMLMs. Targeted adversarial training eliminates the vulnerability (Shayegani et al., 1 Apr 2025).
Multimodal Hallucination Inductions: Directly optimize high-dimensional image embeddings to match arbitrary target content, causing visually undetectable misclassifications and open-ended hallucinations even under robust evaluation (Islam et al., 11 Feb 2025).
Confusion & Jailbreak Attacks: Maximize next-token entropy or manipulate perceptually simple features (masking, echo, decomposition) to bypass safety filters in multimodal LLMs (Hoscilowicz et al., 25 Nov 2025, Kumar et al., 23 Oct 2025).

These techniques demonstrate universality across models (CLIP, BLIP-2, DeepSeek, Janus, ImageBind, OpenVLA, etc.), cross-modality and cross-domain transfer, and resilience to typical input preprocessing (JPEG, blur, geometric transforms) (Salman et al., 2024, Dou et al., 2024, Liao et al., 17 May 2025).

3. Quantitative Findings and Impact

Recent studies report near-perfect attack success rates (ASR):

ImageNet classification: Clean accuracy $a$ 270% drops to $a$ 35% under $a$ 4-perturbations; targeted illusions achieve 100% Top-1 misclassification on adversarial inputs (Zhang et al., 2023, Salman et al., 2024, Liao et al., 17 May 2025).
Zero-shot transfer: Illusions generated on OpenCLIP surrogates retain 90–100% effectiveness on commercial systems (Titan, etc.) and alternate models (Zhang et al., 2023, Dou et al., 2024).
Hallucination rates: DeepSeek Janus models: up to 99.5% under closed-form, 98% under open-ended evaluation, at SSIM $a$ 5 0.88 (Islam et al., 11 Feb 2025).
Video-based LLMs: Flow-based FMM-Attack on a small fraction of frames, inducing incorrect and garbled model outputs [FMM-Attack: (Li et al., 2024)].
Safety/jailbreak: Simple perceptual transformation attacks cause up to 89% ASR in sensitive categories with models showing 0% failure in text-only settings (Kumar et al., 23 Oct 2025).

Defenses such as adversarial training, topological-statistical anomaly detection, consensus-based generative mitigation (e.g. VAE sampling), dedicated projection heads, and prompt updating are only partially effective; sophisticated attacks bypass most conventional image/audio transformation-based defenses (Akbarian et al., 26 Nov 2025, Vu et al., 29 Jan 2025, Liao et al., 17 May 2025, Ye et al., 2023).

4. Detection, Defense, and Mitigation Strategies

Documented methods for detection and mitigation include:

Topological-Contrastive Measures: Employ persistent homology (VR filtrations, total persistence, multi-scale kernels), integrated into Maximum Mean Discrepancy tests for robust detection of adversarial samples (Vu et al., 29 Jan 2025).
Generative Consensus-Based Mitigation: Pass input through pre-trained VAEs or diffusion models, perform stochastic sampling and aggregate predictions via majority vote, reducing attack success rates to near-zero while improving clean accuracy (Akbarian et al., 26 Nov 2025).
Cross-Modal Calibration: Attach modality-specific MLP heads to the frozen encoder; train with adversarially calibrated objectives (cross-entropy, L2, InfoNCE) to restore embedding integrity (Liao et al., 17 May 2025).
Adversarial Training: Incorporating clean/adversarial pairs, structural input variants (RMA), and cross-modal similarity distribution alignment (RLBind), producing dramatic robustness gains without degrading zero-shot or downstream accuracy (Lu, 17 Sep 2025, Shayegani et al., 1 Apr 2025).
Multi-Prompt Hallucination Detection: Leverage LLM-based agents with chains of baseline, source-specific, and target-specific prompts to aggregate semantic outcome evidence (Islam et al., 11 Feb 2025, Chang et al., 31 Jan 2025).
Input Sanitization and Preprocessing: JPEG compression, geometric augmentation, perceptual canonicalization, and feature squeezing—generally effective only against naïve (not defense-aware) attackers (Zhang et al., 2023, Kumar et al., 23 Oct 2025).

Observational analysis suggests future advances will depend on embedding-space anomaly detection, randomized smoothing, and cross-modal contrastive training, with certified robustness and mechanism-level regularization as open research directions.

5. Broader Implications and Open Problems

Adversarial illusions reveal that multi-modal models’ core strengths—generalization, zero-shot learning, emergent cross-modal alignment—are intimately linked to their vulnerabilities. Attacks succeed even in black-box threat models, highlighting risks for real-world deployment in web agents, robotic perception, and safety-critical applications. Illusions cause not only targeted misclassification but also global embedding collapse (doubling entropy, triggering action failures, semantic leakage), and can bypass specialized safety and refusal detectors (Yan et al., 20 Nov 2025, Hoscilowicz et al., 25 Nov 2025, Kumar et al., 23 Oct 2025).

Key open challenges include:

Designing embedding architectures with provable resistance to norm-bounded adversarial perturbations without forfeiting semantic transfer.
Developing scalable and transferable defense mechanisms that do not require retraining heavy encoders or reliance on a single input structure (role, modality placement).
Certifying robustness against cross-modal, task-agnostic attacks in high-dimensional spaces ( $a$ 6– $a$ 7) where benign variation and adversarial drift are entangled.
Unifying semantic-level reasoning and anomaly detection to ensure model outputs reflect genuine cross-modal semantic association (Vu et al., 29 Jan 2025, Salman et al., 2024, Liao et al., 17 May 2025).

6. Historical Context and Evolution

The demonstration of adversarial illusions emerged with diagnostic studies of CLIP and similar models, highlighting the “reading isn't believing” phenomenon and documenting typographical, conceptual, and iconographic attacks that exploit unbalanced modality weighting (Noever et al., 2021). The field has since advanced to encompass structure-based attacks, cross-modal alignment mismatches, and sophisticated black-box transfer via surrogate models and prompt manipulation. The broad consensus is that input-agnostic, cross-modal, and structurally enabled illusions pose the greatest challenge for next-generation robust multi-modal alignment (Zhang et al., 2023, Salman et al., 2024, Shayegani et al., 1 Apr 2025).

7. Future Directions and Research Gaps

Promising directions include:

Mechanistic studies of attention and value-vector sensitivity to universal perturbations (Kim et al., 2024).
Formulation of joint topological-contrastive and statistical regularization objectives to anchor clean–adversarial geometry and topology (Vu et al., 29 Jan 2025, Lu, 17 Sep 2025).
Exploration of generative and multimodal chain-of-thought agents for semantic disillusion and defense (Chang et al., 31 Jan 2025, Akbarian et al., 26 Nov 2025).
Expanding certified defense and anomaly detection to high-dimensional, multi-modal settings while preserving task-agnostic flexibility (Liao et al., 17 May 2025).
Systematic evaluation of consensus-based, multi-prompt, and structure-aware defenses against compound, cross-role, and cross-modality adversarial illusions (Shayegani et al., 1 Apr 2025, Islam et al., 11 Feb 2025).

The ongoing evolution of adversarial illusion research will determine the practical safety and reliability of multi-modal foundation models across scientific, industrial, and societal applications.