Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Published 16 May 2024 in cs.CV | (2405.09981v1)

Abstract: Multi-modal LLMs (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms as follows. Firstly, untargeted adversarial attacks induce MLLMs to generate incorrect bounding boxes for each object. Besides, exclusive targeted adversarial attacks cause all generated outputs to the same target bounding box. In addition, permuted targeted adversarial attacks aim to permute all bounding boxes among different objects within a single image. Extensive experiments demonstrate that the proposed methods can successfully attack visual grounding capabilities of MLLMs. Our methods not only provide a new perspective for designing novel attacks but also serve as a strong baseline for improving the adversarial robustness for visual grounding of MLLMs.

Abstract PDF HTML Upgrade to Chat

References (47)

Citations (11)

View on Semantic Scholar

Summary

The paper proposes three novel adversarial attack paradigms—untargeted, exclusive targeted, and permuted targeted—to assess MLLM visual grounding robustness.
Experiments with MiniGPT-v2 show a significant drop in [email protected] for untargeted attacks, demonstrating the disruptive effect of adversarial perturbations.
Findings underscore the need for robust defense mechanisms to enhance the reliability and security of multimodal models in object localization tasks.

Adversarial Robustness for Visual Grounding of Multimodal LLMs

Introduction

The paper "Adversarial Robustness for Visual Grounding of Multimodal LLMs" explores the limitations in the adversarial robustness of visual grounding capabilities in Multimodal LLMs (MLLMs). Despite their advancements, MLLMs remain vulnerable to adversarial attacks, which pose significant risks to their reliability and accuracy in vision-language tasks. This paper specifically addresses this gap by evaluating adversarial robustness through novel attack paradigms.

Adversarial Attack Paradigms

The authors propose three adversarial attack paradigms to assess the visual grounding vulnerabilities of MLLMs: untargeted, exclusive targeted, and permuted targeted attacks. These paradigms aim to disrupt the model's ability to accurately localize objects in an image based on given textual prompts by generating adversarial images designed to mislead the MLLM's predictions.

Untargeted Adversarial Attacks: This method aims to decrease the prediction accuracy of bounding boxes, causing MLLMs to mislocate objects. Two techniques are employed: the image embedding attack, which maximizes the distance between original and adversarial image embeddings; and the textual bounding box attack, which minimizes the probability of accurate textual generation of bounding boxes.
Figure 1: Three adversarial attack paradigms are proposed to evaluate the adversarial robustness for visual grounding of MLLMs.
Exclusive Targeted Adversarial Attacks: This paradigm misleads MLLMs into consistently predicting a single target bounding box regardless of the input prompts. The target bounding box could be a predetermined location, such as the top-left corner, used as a deceptive output.
Permuted Targeted Adversarial Attacks: Unlike exclusive targeted attacks, this method rearranges bounding boxes among different objects within the same image, forcing the model to inaccurately swap object locations based on incorrect references.

Experimental Setup and Results

Experiments were conducted using the MiniGPT-v2 model on datasets like RefCOCO, RefCOCO+, and RefCOCOg. The adversarial attacks were optimized using Projected Gradient Descent (PGD) under specific perturbation constraints. The results show a significant drop in [email protected] for untargeted attacks, indicating a successful degradation in MLLMs' prediction accuracy. Conversely, targeted attacks exhibited improved [email protected] metrics when adversarial perturbations aligned with altered bounding box labels, especially in the exclusive targeted attack paradigm.

The experiments demonstrate that untargeted adversarial attacks are capable of substantially reducing MLLM accuracy. In comparison, exclusive and permuted targeted attacks effectively manipulate the model's output to align with incorrect, predetermined bounding boxes, although permuted attacks remain more challenging due to the dynamic nature of input and target bounding box relationships.

This study builds upon previous research in adversarial robustness, particularly within vision-language tasks. While many adversarial scenarios in MLLMs have focused on image captioning and question answering, this work shifts focus to visual grounding—a task critical for object localization and spatial understanding. The introduction of these adversarial attack paradigms represents a meaningful step forward in assessing and understanding vulnerabilities in advanced AI systems like MLLMs.

Conclusion

The findings of this research offer valuable insights into the adversarial vulnerabilities of MLLMs and highlight the necessity for improved robustness in visual grounding tasks. These attack paradigms not only serve as benchmarks for future evaluations but also lay a foundation for developing more secure and resilient multimodal models. Further research could refine these paradigms or introduce novel defensive mechanisms to enhance the reliability of MLLMs in diverse, real-world applications.

Markdown Report Issue