Cross-Modal Adversarial Alignment

Updated 1 February 2026

Cross-modal adversarial alignment is a framework that uses adversarial optimization to align heterogeneous feature representations across multiple modalities.
The approach employs a two-stage process combining adversarial fine-tuning and invariant alignment, yielding robust accuracy improvements of up to 47 percentage points.
Adversarial attack paradigms craft imperceptible perturbations to force embedding convergence or divergence, challenging current multimodal defenses.

Cross-modal adversarial alignment refers to the adversarial learning or attack schemes designed to optimally align or misalign feature representations across heterogeneous modalities (e.g., image, text, audio, video) under threat models or for robust cross-modal embedding. The paradigm spans both defensive strategies—preserving alignment against adversarial perturbations—and attack frameworks—manipulating inputs to force cross-modal embedding convergence or divergence and disrupt semantic integrity in multimodal models.

1. Formal Definition and Core Challenges

Cross-modal adversarial alignment denotes approaches where adversarial optimization is used to either force the embeddings of heterogeneous inputs (such as images and text) to align in a shared space, or to actively seek modality-invariant representations via adversarial training (Huang et al., 2017, Wang et al., 2019, Lu, 17 Sep 2025). In attack regimes, it refers to methods for crafting perturbations δ such that, after passing through a shared encoder f(·), the perturbed sample’s embedding matches that of a target from another modality (Dou et al., 2024, Salman et al., 2024, Zhou et al., 2023, Zhu et al., 28 Oct 2025).

Challenges include:

Modality gap: Distinct signal statistics and intrinsic feature-space geometry make direct alignment nontrivial.
Stealth vs. efficacy tradeoff: Perturbations must remain imperceptible (norm-bounded) yet effect strong semantic drift in the joint space.
Optimization in high dimensions: Non-convex losses over deep multimodal encoders (Dou et al., 2024).

2. Adversarial Training and Modality-Invariant Embeddings

Defensive adversarial-invariant alignment formalizes objectives that minimize the discrepancy between adversarially perturbed and clean embeddings across domains and modalities. RLBind (Lu, 17 Sep 2025) introduces a two-stage procedure:

Stage 1: Unsupervised adversarial fine-tuning (FARE). Formally, for frozen text encoder ψ and trainable visual/other encoders φₘ,

$\delta^\star = \arg\max_{\|\delta\|_\infty \leq \epsilon} \|\phi(x+\delta) - \phi_\text{org}(x)\|_2^2$

$L_\text{FARE}(\phi;x) = \max_{\|\delta\|_\infty \leq \epsilon} \|\phi(x+\delta)-\phi_\text{org}(x)\|_2^2$

Stage 2: Adversarial-invariant cross-modal alignment, simultaneously optimizing:
- Clean and adversarial cross-entropy,
- Point-wise and distribution-wise cross-modal alignment:
$L_2 = \frac{1}{C}\sum_{i=1}^C |s^\text{clean}_i - s^\text{adv}_i|^2$

$L_\text{KL} = \text{KL}(P\|Q) + \text{KL}(Q\|P)$

RLBind demonstrates up to +47 percentage points in robust accuracy over base LanguageBind on ImageNet-1K, with parallel gains for Audio, Video, and Thermal modalities.

Gradient-reversal strategies likewise promote modality-invariant cross-modal codes (Huang et al., 2017, Jung et al., 22 May 2025), combining semantic classifiers with domain discriminators in a min–max setting so that learned embeddings remain informative for tasks but non-discriminative for modality.

Adversarial attacks exploit cross-modal alignment by constructing input perturbations whose embeddings match attacker-chosen targets from other modalities. PGD-based gradient descent is routinely employed:

Direct feature alignment: For input x and target t, perturb δ such that

$\delta^* = \arg\min_{\|\delta\|_\infty \leq \epsilon} \|f_I(x+\delta) - f_T(t_\text{tg})\|^2_2$

(Salman et al., 2024).

Angular deviation minimization: Attacker seeks

$\min_\delta \ L(\delta) = \|\hat{f}(tv) - \hat{f}(v+\delta)\|^2_2$

Equivalent to maximizing cosine similarity between normalized embeddings (Dou et al., 2024).

Approaches such as CrossFire (Dou et al., 2024) transform target inputs t to the same signal domain as source v (text-to-image via diffusion), then optimize δ. AdvCLIP (Zhou et al., 2023) further disrupts higher-order topology of the joint image/text feature graph via a GAN-based universal patch. These achieve high attack success rates ( $ASR \approx 0.92-0.99$ ), and remain effective (>0.85 ASR) even under known defenses.

In open-vocabulary keyword spotting and similar metric-learning tasks, adversarial modality classifiers trained with gradient reversal promote uniformity of text and audio representations in a shared space (Jung et al., 22 May 2025). Metric-learning objectives at both phoneme- and utterance-level are optimized jointly with modality-invariant adversarial loss:

$L_\text{enc} = L_\text{DML} + \lambda_\text{adv} L_\text{adv}$

Ensembles of proxy losses and class-level aggregation are combined with adaptive scaling and domain classifiers. Empirical gains include AP ↑14.8% over non-adversarial baselines on WSJ.

5. Restoration and Robustification Mechanisms

Defensive mechanisms center on restoring cross-modal alignment after adversarial drift. COLA (Zhu et al., 28 Oct 2025) projects attacked image embeddings onto the span of class text features, then applies entropic optimal transport to enforce global and local alignment of augmented image views and text variants:

Projection: Discards non-semantic distortions,
Optimal transport: Regularizes local-neighborhood structure.

This training-free pipeline achieves +48.9 percentage points robustness gain under PGD attacks across multiple benchmarks, with near-zero impact on clean accuracy.

Text-guided image inpainting leverages cross-modal alignment distillation to transfer fine-grained vision–LLM correlations into GAN-based image restoration architectures (Zhou et al., 2023). Cross-modal alignment maps and in-sample distribution matching are critical to preserve semantically faithful inpainted regions, with adversarial critics at both global and local scales.

6. Safety Alignment and Adversarial Unlearning

Safety alignment in vision-LLMs exploits cross-modal adversarial unlearning: textual unlearning propagates robust safety responses from the language component to the entire multimodal stack (Chakraborty et al., 2024). Training only σ (LLM weights), with harmful generation penalty and safe-reply matches,

$\sigma_{t+1} = \sigma_t - \left[ -\eta_\text{harm} \nabla_\sigma \mathcal{L}_\text{harm} + \eta_\text{help} \nabla_\sigma \mathcal{L}_\text{help} + \eta_\text{util} \nabla_\sigma \mathcal{L}_\text{util} \right]$

reduces attack success rates to <8% (sometimes <2%) for both text-only and multimodal threats, outperforming multimodal SFT/unlearning domains at 1/6 compute cost. Since all modalities fuse into the language head, text-only safety coverage is sufficient for universal cross-modal defense.

7. Evaluation Protocols and Empirical Findings

Experimental setups span retrieval, classification, and generative tasks, employing datasets such as ImageNet, MSCOCO, AudioCaps, Recipe1M, and PKU-SafeRLHF. Key quantitative results include:

Method	Domain(s)	ASR/Robust Acc.	Key Findings
RLBind (Lu, 17 Sep 2025)	Img/Aud/Therm/Video	↑47pp robust	Dot-product + L₂ best trade-off; fully finetuned > LoRA
CrossFire (Dou et al., 2024)	Img/Aud	0.93–0.99 ASR	Modality transformation + norm yields state-of-the-art attacks
AdvCLIP (Zhou et al., 2023)	Img/Text	0.80/0.62(ASR)	Topology-deviation GAN achieves transferability, outpaces defenses
COLA (Zhu et al., 28 Oct 2025)	Img/Text	↑48.9pp robust	Projection + OT recovers alignment post-attack
ACME (Wang et al., 2019)	Img/Text	MedR=1.0, R@1>50%	Distribution-alignment + cross-modal translation consistency
MAL (Jung et al., 22 May 2025)	Aud/Text	AP ↑14.8%	Plug-and-play adversarial learning improves open-vocab KWS
Text Unlearning (Chakraborty et al., 2024)	VLMs	ASR <8%	Text-only safety propagates to all modalities

Defense strategies such as JPEG compression, input corruption, parameter pruning, adversarial retraining, and subspace projection variously succeed or fail depending on degree of embedding regularization and attack adaptation. Most current defenses do not sufficiently counter cross-modal alignment attacks.

8. Limitations, Open Problems, and Future Directions

Limitations include difficulty in defending against multi-element semantic targets (ASR drops as complexity grows), persistence of adversarial transferability across black-box encoders, and lack of exhaustive multi-modal “harmful” datasets for safety alignment (Dou et al., 2024, Chakraborty et al., 2024). Unsupervised segmentation bottlenecks remain for zero-resource speech/text alignment (Chung et al., 2018).

Future directions entail curriculum-augmented adversarial training, Wasserstein-GAN domain discriminators, explicit detection protocols via adversarial sensitivity tests, more comprehensive alignment-sensitive inductive biases, expanded multilingual/multimodal settings, and incremental pipelines for dynamic threat absorption (Lu, 17 Sep 2025, Zhu et al., 28 Oct 2025, Jung et al., 22 May 2025, Chakraborty et al., 2024). Novel defense design explicitly targeting embedding-level cross-modal alignment is an active research area.