Papers
Topics
Authors
Recent
Search
2000 character limit reached

Co-Attack: Collaborative Multimodal Adversarial Attack

Updated 15 February 2026
  • The paper introduces Co-Attack, a method that jointly optimizes adversarial perturbations in both image and text modalities to maximally disrupt vision-language models.
  • Empirical evaluations show significant performance drops in tasks like image-text retrieval and visual entailment, confirming the attack's robustness and effectiveness.
  • The collaborative optimization framework underscores cross-modal coupling, prompting the need for multimodal adversarial training and robust fusion defenses.

Collaborative Multimodal Adversarial Attack (Co-Attack) refers to a class of adversarial attack strategies targeting multimodal deep learning models, specifically vision-language pre-training (VLP) models that jointly process image and text inputs. Co-Attack leverages coordinated perturbations in both the visual and textual modalities to maximally disrupt the fused or aligned representations learned within these models, thereby severely degrading their performance on diverse downstream tasks such as image-text retrieval, visual entailment, visual grounding, and person re-identification.

1. Formal Threat Model and Objectives

Co-Attack is designed for models f:(x,t)yf: (x, t) \mapsto y—where xRH×W×3x \in \mathbb{R}^{H\times W\times 3} is an image, tt a sequence of nn word tokens, and yy a prediction or joint embedding. The typical white-box threat model grants the attacker full access to model weights and gradients for crafting adversarial examples, though extensions exist for improving black-box transferability. The attacker seeks small, often imperceptible, perturbations:

  • Image: x=x+δimgx' = x + \delta_{\text{img}}, with δimgϵimg\|\delta_{\text{img}}\|_{\infty} \leq \epsilon_{\text{img}} (e.g., ϵimg=2/255\epsilon_{\text{img}}=2/255)
  • Text: t=Edit(t,δtext)t' = \text{Edit}(t, \delta_{\text{text}}), with EditDist(t,t)B\text{EditDist}(t, t') \leq B (typically B=1B=1 token change) with the goal of maximizing a task-specific adversarial loss enforcing misalignment or misclassification (Zhang et al., 2022).

2. Mathematical Formulation and Attack Pipeline

The technical hallmark of Co-Attack is collaborative optimization, explicitly coupling perturbations on both modalities so as to maximize the net adversarial effect on the model’s multimodal representation.

Fused VLPs (e.g., ALBEF, TCL)

For models with explicit fusion, the attack proceeds in two phases:

  1. Text Attack: Find tt' that maximizes the joint embedding deviation:

t=argmaxEditDist(t,t)BEm(Ei(x),Et(t))Em(Ei(x),Et(t))2t' = \arg\max_{\text{EditDist}(t, t') \leq B} \left\| E_m(E_i(x), E_t(t')) - E_m(E_i(x), E_t(t)) \right\|_2

  1. Image Attack (Collaborative):

maxδimgϵimg{em(x+δimg,t)em(x,t)2+α1em(x+δimg,t)em(x,t)2}\max_{\|\delta_{\text{img}}\|_\infty \leq \epsilon_{\text{img}}} \left\{ \|e_m(x + \delta_{\text{img}}, t') - e_m(x, t')\|_2 + \alpha_1 \|e_m(x + \delta_{\text{img}}, t') - e_m(x, t)\|_2 \right\}

The collaborative α\alpha term enforces concerted movement in the joint embedding space.

Aligned VLPs (e.g., CLIP)

Text tt' is found by maximizing the deviation in the text embedding. Image attack optimizes:

maxδimgϵimgEi(x+δimg)Ei(x)2+α2Ei(x+δimg)Et(t)2\max_{\|\delta_{\text{img}}\|_\infty \leq \epsilon_{\text{img}}} \|E_i(x+\delta_{\text{img}}) - E_i(x)\|_2 + \alpha_2 \, \|E_i(x+\delta_{\text{img}}) - E_t(t')\|_2

Here, the second term prevents cancellation between modalities.

Generalized Algorithm (Pseudocode):

1
2
3
4
5
6
7
8
9
10
11
Inputs: (x, t), budgets ε_img, B, model f, step size η, PGD steps K, weight α
1. # Text attack (discrete search)
   t_adv = BERT-Attack(t, maximize ||e(·, t') – e(·, t)||)
2. # Image attack (projected gradient ascent)
   δ = 0; x_adv = x
   for k in 1K:
       g = _{x_adv} [ ||e(, t_adv)  e_clean(, t_adv)||_2 + α * ||e(, t_adv)  e_clean(, t)||_2 ]
       δ += η * sign(g)
       δ = clip(δ, -ε_img, +ε_img)
       x_adv = x + δ
3. return (x_adv, t_adv)
Co-Attack consistently outperforms single-modal attacks and prior non-collaborative bi-modal methods on a variety of V+L architectures and tasks (Zhang et al., 2022).

3. Empirical Evaluation Across Vision-Language Tasks

Experimental validation demonstrates the robustness and general applicability of Co-Attack across standard VLP benchmarks:

Task/Model Vanilla ASR (%) Co-Attack ASR (%)
ALBEF Flickr30K R@1 63.8 70.6
CLIP_ViT Flickr30K 64.0 73.8
ALBEF SNLI-VE 70.6 79.3
TCL SNLI-VE 66.5 76.4
ALBEF RefCOCO+ 16.5 19.2

Attack success rate (ASR) is defined as the drop in original task accuracy following adversarial perturbation.

Ablation studies confirm that the collaborative loss term α\alpha is essential; performance plateaus for α1\alpha \geq 1, indicating practical robustness to α\alpha tuning. Analysis of embedding geometry shows that naive bi-modal attacks result in large angles (near-cancellation), whereas Co-Attack ensures aligned, reinforcing adversarial shifts.

4. Methodological Extensions: FGA-T, CMI-Attack, JMTFA, and Mutual-Modality Frameworks

Multiple works refine or extend the collaborative multimodal attack paradigm:

  • Feature Guidance with Text Attack (FGA-T) introduces text features as semantic centroids to guide image perturbations, constructing the loss to push image features away from correct and toward “adversarially guided” text embeddings (Zheng et al., 2024). FGA-T sharply outperforms Co-Attack, reducing VE accuracy to 2.78% vs 19.36% for CA.
  • Collaborative Multimodal Interaction Attack (CMI-Attack) attacks text directly in embedding space (using GloVe neighbors), accumulates image gradients during text attack (“Interaction Gradient Information”), and jointly refines both modalities via multimodal gradients. Black-box transfer is markedly increased, e.g., by 8.11–16.75pp over SGA (Fu et al., 2024).
  • Joint Multimodal Transformer Feature Attack (JMTFA) directly targets attention-relevance scores in VLP transformers. Each update for one modality is synchronized with the most recent attention response from the other, maximizing model confusion via attention-weighted gradients. JMTFA further amplifies attack effectiveness and reveals the centrality of the text stream in VLP vulnerability (Guan et al., 2024).
  • Mutual-Modality Adversarial Attack deploys a black-box/transferable strategy alternating between visual attack (universal perturbations on CLIP) and textual prompt updates. The resulting adversarial interplay produces highly transferable, semantically impactful attacks (Ye et al., 2023).
  • Modality Unified Attack (MUA): In omnipresent person re-ID, MUA fits per-modality generators to collectively disrupt fused feature spaces (“Multi-Modality Collaborative Disruption” and “Cross Modality Simulated Disruption”). This approach achieves up to 62.7% mean mAP Drop Rate, outperforming prior attacks and showing the significance of cross-modal coordination (Bian et al., 22 Jan 2025).

5. Comparative Table of Key Collaborative Multimodal Attack Methods

Attack Joint Update? Embedding Guidance Attention/Coupling Transferability
Co-Attack Feature deviation Aligned angle loss Moderate (white-box)
FGA-T Text as centroids Tight cross-modal tie High
CMI-Attack GloVe, image grad IGI multimodal grad Highest (transfer)
JMTFA Attention relevance Cross-attention synch High (white-box)
MUA Modality-specific CMSD/MMCD State-of-art (re-ID)

6. Impact, Limitations, and Defensive Strategies

Collaborative multimodal attacks reveal fundamental vulnerabilities in VLPs:

  • Even tightly-coupled cross-modal fusion architectures are susceptible; joint attacks outperform separate/unimodal attacks by large margins (Zhang et al., 2022Zheng et al., 2024Fu et al., 2024).
  • Textual perturbations frequently exert outsized impact, especially when cross-attention intertwines word tokens with visual patches (Guan et al., 2024).
  • Transferability is maximized when attacks exploit multimodal gradients, semantic priors (text embeddings), and attention alignment, invalidating many unimodal defense intuitions.
  • No clear relationship exists between model size and adversarial robustness in the multimodal context (Guan et al., 2024).

Defenses:

A plausible implication is that future multimodal model design must integrate robustness objectives at the fusion and attention levels, not just within unimodal branches.

7. Future Directions and Open Challenges

Current research trajectories suggest several advances:

  • Extending collaborative attacks to additional modalities (audio, video, 3D point cloud) (Dou et al., 2024).
  • Certified defense frameworks in shared embedding spaces (randomized smoothing, adversarial certificates) (Dou et al., 2024).
  • End-to-end robust multimodal pretraining with adversarial negatives and regularized alignment (Zhang et al., 2022).
  • Adaptive detection based on learned distributions of cross-modal similarity.
  • Exploring continuous prompt learning for scalable, high-level adversarial guidance (Ye et al., 2023).
  • Theoretical analysis of saddle-point dynamics between attack and defense in multimodal alignment spaces.

Collaborative Multimodal Adversarial Attack, established initially in (Zhang et al., 2022), now defines a broad methodological paradigm underlying both contemporary attack strategies and corresponding defenses for vision-language and more general multimodal models. This framework continues to guide robust model design and systematic adversarial evaluation across the expanding multimodal AI landscape.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Collaborative Multimodal Adversarial Attack (Co-Attack).