Co-Attack: Collaborative Multimodal Adversarial Attack

Updated 15 February 2026

The paper introduces Co-Attack, a method that jointly optimizes adversarial perturbations in both image and text modalities to maximally disrupt vision-language models.
Empirical evaluations show significant performance drops in tasks like image-text retrieval and visual entailment, confirming the attack's robustness and effectiveness.
The collaborative optimization framework underscores cross-modal coupling, prompting the need for multimodal adversarial training and robust fusion defenses.

Collaborative Multimodal Adversarial Attack (Co-Attack) refers to a class of adversarial attack strategies targeting multimodal deep learning models, specifically vision-language pre-training (VLP) models that jointly process image and text inputs. Co-Attack leverages coordinated perturbations in both the visual and textual modalities to maximally disrupt the fused or aligned representations learned within these models, thereby severely degrading their performance on diverse downstream tasks such as image-text retrieval, visual entailment, visual grounding, and person re-identification.

1. Formal Threat Model and Objectives

Co-Attack is designed for models $f: (x, t) \mapsto y$ —where $x \in \mathbb{R}^{H\times W\times 3}$ is an image, $t$ a sequence of $n$ word tokens, and $y$ a prediction or joint embedding. The typical white-box threat model grants the attacker full access to model weights and gradients for crafting adversarial examples, though extensions exist for improving black-box transferability. The attacker seeks small, often imperceptible, perturbations:

Image: $x' = x + \delta_{\text{img}}$ , with $\|\delta_{\text{img}}\|_{\infty} \leq \epsilon_{\text{img}}$ (e.g., $\epsilon_{\text{img}}=2/255$ )
Text: $t' = \text{Edit}(t, \delta_{\text{text}})$ , with $\text{EditDist}(t, t') \leq B$ (typically $B=1$ token change) with the goal of maximizing a task-specific adversarial loss enforcing misalignment or misclassification (Zhang et al., 2022).

2. Mathematical Formulation and Attack Pipeline

The technical hallmark of Co-Attack is collaborative optimization, explicitly coupling perturbations on both modalities so as to maximize the net adversarial effect on the model’s multimodal representation.

Fused VLPs (e.g., ALBEF, TCL)

For models with explicit fusion, the attack proceeds in two phases:

Text Attack: Find $t'$ that maximizes the joint embedding deviation:

$t' = \arg\max_{\text{EditDist}(t, t') \leq B} \left\| E_m(E_i(x), E_t(t')) - E_m(E_i(x), E_t(t)) \right\|_2$

Image Attack (Collaborative):

$\max_{\|\delta_{\text{img}}\|_\infty \leq \epsilon_{\text{img}}} \left\{ \|e_m(x + \delta_{\text{img}}, t') - e_m(x, t')\|_2 + \alpha_1 \|e_m(x + \delta_{\text{img}}, t') - e_m(x, t)\|_2 \right\}$

The collaborative $\alpha$ term enforces concerted movement in the joint embedding space.

Aligned VLPs (e.g., CLIP)

Text $t'$ is found by maximizing the deviation in the text embedding. Image attack optimizes:

$\max_{\|\delta_{\text{img}}\|_\infty \leq \epsilon_{\text{img}}} \|E_i(x+\delta_{\text{img}}) - E_i(x)\|_2 + \alpha_2 \, \|E_i(x+\delta_{\text{img}}) - E_t(t')\|_2$

Here, the second term prevents cancellation between modalities.

Generalized Algorithm (Pseudocode):

Inputs: (x, t), budgets ε_img, B, model f, step size η, PGD steps K, weight α
1. # Text attack (discrete search)
   t_adv = BERT-Attack(t, maximize ||e(·, t') – e(·, t)||)
2. # Image attack (projected gradient ascent)
   δ = 0; x_adv = x
   for k in 1…K:
       g = ∇_{x_adv} [ ||e(…, t_adv) – e_clean(…, t_adv)||_2 + α * ||e(…, t_adv) – e_clean(…, t)||_2 ]
       δ += η * sign(g)
       δ = clip(δ, -ε_img, +ε_img)
       x_adv = x + δ
3. return (x_adv, t_adv)

Co-Attack consistently outperforms single-modal attacks and prior non-collaborative bi-modal methods on a variety of V+L architectures and tasks (Zhang et al., 2022).

3. Empirical Evaluation Across Vision-Language Tasks

Experimental validation demonstrates the robustness and general applicability of Co-Attack across standard VLP benchmarks:

Task/Model	Vanilla ASR (%)	Co-Attack ASR (%)
ALBEF Flickr30K R@1	63.8	70.6
CLIP_ViT Flickr30K	64.0	73.8
ALBEF SNLI-VE	70.6	79.3
TCL SNLI-VE	66.5	76.4
ALBEF RefCOCO+	16.5	19.2

Attack success rate (ASR) is defined as the drop in original task accuracy following adversarial perturbation.

Ablation studies confirm that the collaborative loss term $\alpha$ is essential; performance plateaus for $\alpha \geq 1$ , indicating practical robustness to $\alpha$ tuning. Analysis of embedding geometry shows that naive bi-modal attacks result in large angles (near-cancellation), whereas Co-Attack ensures aligned, reinforcing adversarial shifts.

4. Methodological Extensions: FGA-T, CMI-Attack, JMTFA, and Mutual-Modality Frameworks

Multiple works refine or extend the collaborative multimodal attack paradigm:

Feature Guidance with Text Attack (FGA-T) introduces text features as semantic centroids to guide image perturbations, constructing the loss to push image features away from correct and toward “adversarially guided” text embeddings (Zheng et al., 2024). FGA-T sharply outperforms Co-Attack, reducing VE accuracy to 2.78% vs 19.36% for CA.
Collaborative Multimodal Interaction Attack (CMI-Attack) attacks text directly in embedding space (using GloVe neighbors), accumulates image gradients during text attack (“Interaction Gradient Information”), and jointly refines both modalities via multimodal gradients. Black-box transfer is markedly increased, e.g., by 8.11–16.75pp over SGA (Fu et al., 2024).
Joint Multimodal Transformer Feature Attack (JMTFA) directly targets attention-relevance scores in VLP transformers. Each update for one modality is synchronized with the most recent attention response from the other, maximizing model confusion via attention-weighted gradients. JMTFA further amplifies attack effectiveness and reveals the centrality of the text stream in VLP vulnerability (Guan et al., 2024).
Mutual-Modality Adversarial Attack deploys a black-box/transferable strategy alternating between visual attack (universal perturbations on CLIP) and textual prompt updates. The resulting adversarial interplay produces highly transferable, semantically impactful attacks (Ye et al., 2023).
Modality Unified Attack (MUA): In omnipresent person re-ID, MUA fits per-modality generators to collectively disrupt fused feature spaces (“Multi-Modality Collaborative Disruption” and “Cross Modality Simulated Disruption”). This approach achieves up to 62.7% mean mAP Drop Rate, outperforming prior attacks and showing the significance of cross-modal coordination (Bian et al., 22 Jan 2025).

5. Comparative Table of Key Collaborative Multimodal Attack Methods

Attack	Joint Update?	Embedding Guidance	Attention/Coupling	Transferability
Co-Attack	✓	Feature deviation	Aligned angle loss	Moderate (white-box)
FGA-T	✓	Text as centroids	Tight cross-modal tie	High
CMI-Attack	✓	GloVe, image grad	IGI multimodal grad	Highest (transfer)
JMTFA	✓	Attention relevance	Cross-attention synch	High (white-box)
MUA	✓	Modality-specific	CMSD/MMCD	State-of-art (re-ID)

6. Impact, Limitations, and Defensive Strategies

Collaborative multimodal attacks reveal fundamental vulnerabilities in VLPs:

Even tightly-coupled cross-modal fusion architectures are susceptible; joint attacks outperform separate/unimodal attacks by large margins (Zhang et al., 2022 Zheng et al., 2024 Fu et al., 2024).
Textual perturbations frequently exert outsized impact, especially when cross-attention intertwines word tokens with visual patches (Guan et al., 2024).
Transferability is maximized when attacks exploit multimodal gradients, semantic priors (text embeddings), and attention alignment, invalidating many unimodal defense intuitions.
No clear relationship exists between model size and adversarial robustness in the multimodal context (Guan et al., 2024).

Defenses:

Multimodal adversarial training with collaborative examples (Zhang et al., 2022).
Adversarial input detection via monitoring large embedding shifts (Zhang et al., 2022).
Certified robustness using randomization and smoothing over both modalities (Zhang et al., 2022).
Multimodal joint denoising (e.g., JPEG for images, paraphrase consistency for text).
Attention-masking and robust aggregation for cross-attention vulnerabilities (Guan et al., 2024).

A plausible implication is that future multimodal model design must integrate robustness objectives at the fusion and attention levels, not just within unimodal branches.

7. Future Directions and Open Challenges

Current research trajectories suggest several advances:

Extending collaborative attacks to additional modalities (audio, video, 3D point cloud) (Dou et al., 2024).
Certified defense frameworks in shared embedding spaces (randomized smoothing, adversarial certificates) (Dou et al., 2024).
End-to-end robust multimodal pretraining with adversarial negatives and regularized alignment (Zhang et al., 2022).
Adaptive detection based on learned distributions of cross-modal similarity.
Exploring continuous prompt learning for scalable, high-level adversarial guidance (Ye et al., 2023).
Theoretical analysis of saddle-point dynamics between attack and defense in multimodal alignment spaces.

Collaborative Multimodal Adversarial Attack, established initially in (Zhang et al., 2022), now defines a broad methodological paradigm underlying both contemporary attack strategies and corresponding defenses for vision-language and more general multimodal models. This framework continues to guide robust model design and systematic adversarial evaluation across the expanding multimodal AI landscape.