Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models

Published 16 May 2024 in cs.CV | (2405.09981v1)

Abstract: Multi-modal LLMs (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms as follows. Firstly, untargeted adversarial attacks induce MLLMs to generate incorrect bounding boxes for each object. Besides, exclusive targeted adversarial attacks cause all generated outputs to the same target bounding box. In addition, permuted targeted adversarial attacks aim to permute all bounding boxes among different objects within a single image. Extensive experiments demonstrate that the proposed methods can successfully attack visual grounding capabilities of MLLMs. Our methods not only provide a new perspective for designing novel attacks but also serve as a strong baseline for improving the adversarial robustness for visual grounding of MLLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. (ab) using images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490, 2023.
  3. Targeted attack for deep hashing based retrieval. In ECCV, 2020a.
  4. Hardly perceptible trojan attack against neural networks with bit flips. In ECCV, 2022a.
  5. Targeted attack against deep neural networks via flipping limited weight bits. In ICLR, 2022b.
  6. Badclip: Trigger-aware prompt learning for backdoor attacks on clip. arXiv preprint arXiv:2311.16194, 2023a.
  7. Versatile weight attack via flipping limited bits. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
  8. Improving query efficiency of black-box adversarial attack. In ECCV, 2020b.
  9. Improving adversarial robustness via channel-wise activation suppressing. In ICLR, 2021.
  10. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705, 2019.
  11. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  12. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In CVPR, 2022.
  13. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a.
  14. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
  15. Boosting adversarial attacks with momentum. In CVPR, 2018.
  16. How robust is google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751, 2023.
  17. Imperceptible and robust backdoor attack in 3d point cloud. IEEE Transactions on Information Forensics and Security, 19:1267–1282, 2023a.
  18. Backdoor defense via adaptively splitting poisoned dataset. In CVPR, 2023b.
  19. Inducing high energy-latency of large vision-language models with verbose images. In ICLR, 2024a.
  20. Energy-latency manipulation of multi-modal large language models via verbose samples. arXiv preprint arXiv:2404.16557, 2024b.
  21. Explaining and harnessing adversarial examples. In ICLR, 2015.
  22. Black-box adversarial attacks with limited queries and information. In ICML, 2018.
  23. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  24. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
  25. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  27. Referring transformer: A one-step approach to multi-task visual grounding. In NeurIPS, 2021.
  28. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022b.
  29. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  30. Visual knowledge graph for human action reasoning in videos. In ACM MM, 2022a.
  31. Simvtp: Simple video text pre-training with masked autoencoders. arXiv preprint arXiv:2212.03490, 2022b.
  32. Follow your pose: Pose-guided text-to-video generation using pose-free videos. In AAAI, 2024.
  33. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
  34. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  35. OpenAI. Gpt-4 technical report. 2023.
  36. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  37. Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213, 2023.
  38. Tbt: Targeted neural network attack with bit trojan. In CVPR, 2020.
  39. Poison frogs! targeted clean-label poisoning attacks on neural networks. In NeurIPS, 2018.
  40. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In NeurIPS, 2024a.
  41. Stop reasoning! when multimodal llms with chain-of-thought reasoning meets adversarial images. arXiv preprint arXiv:2402.14899, 2024b.
  42. Cheating suffix: Targeted attack to text-to-image diffusion models with multi-modal priors. arXiv preprint arXiv:2402.01369, 2024.
  43. Modeling context in referring expressions. In ECCV, 2016.
  44. Grounding referring expressions in images by variational context. In CVPR, 2018.
  45. Theoretically principled trade-off between robustness and accuracy. In ICML, 2019.
  46. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023.
  47. Minigpt-4: Enhancing vision-language understanding with advanced large language models. 2023.
Citations (11)

Summary

  • The paper proposes three novel adversarial attack paradigms—untargeted, exclusive targeted, and permuted targeted—to assess MLLM visual grounding robustness.
  • Experiments with MiniGPT-v2 show a significant drop in [email protected] for untargeted attacks, demonstrating the disruptive effect of adversarial perturbations.
  • Findings underscore the need for robust defense mechanisms to enhance the reliability and security of multimodal models in object localization tasks.

Adversarial Robustness for Visual Grounding of Multimodal LLMs

Introduction

The paper "Adversarial Robustness for Visual Grounding of Multimodal LLMs" explores the limitations in the adversarial robustness of visual grounding capabilities in Multimodal LLMs (MLLMs). Despite their advancements, MLLMs remain vulnerable to adversarial attacks, which pose significant risks to their reliability and accuracy in vision-language tasks. This paper specifically addresses this gap by evaluating adversarial robustness through novel attack paradigms.

Adversarial Attack Paradigms

The authors propose three adversarial attack paradigms to assess the visual grounding vulnerabilities of MLLMs: untargeted, exclusive targeted, and permuted targeted attacks. These paradigms aim to disrupt the model's ability to accurately localize objects in an image based on given textual prompts by generating adversarial images designed to mislead the MLLM's predictions.

  • Untargeted Adversarial Attacks: This method aims to decrease the prediction accuracy of bounding boxes, causing MLLMs to mislocate objects. Two techniques are employed: the image embedding attack, which maximizes the distance between original and adversarial image embeddings; and the textual bounding box attack, which minimizes the probability of accurate textual generation of bounding boxes. Figure 1

    Figure 1: Three adversarial attack paradigms are proposed to evaluate the adversarial robustness for visual grounding of MLLMs.

  • Exclusive Targeted Adversarial Attacks: This paradigm misleads MLLMs into consistently predicting a single target bounding box regardless of the input prompts. The target bounding box could be a predetermined location, such as the top-left corner, used as a deceptive output.
  • Permuted Targeted Adversarial Attacks: Unlike exclusive targeted attacks, this method rearranges bounding boxes among different objects within the same image, forcing the model to inaccurately swap object locations based on incorrect references.

Experimental Setup and Results

Experiments were conducted using the MiniGPT-v2 model on datasets like RefCOCO, RefCOCO+, and RefCOCOg. The adversarial attacks were optimized using Projected Gradient Descent (PGD) under specific perturbation constraints. The results show a significant drop in [email protected] for untargeted attacks, indicating a successful degradation in MLLMs' prediction accuracy. Conversely, targeted attacks exhibited improved [email protected] metrics when adversarial perturbations aligned with altered bounding box labels, especially in the exclusive targeted attack paradigm.

The experiments demonstrate that untargeted adversarial attacks are capable of substantially reducing MLLM accuracy. In comparison, exclusive and permuted targeted attacks effectively manipulate the model's output to align with incorrect, predetermined bounding boxes, although permuted attacks remain more challenging due to the dynamic nature of input and target bounding box relationships.

This study builds upon previous research in adversarial robustness, particularly within vision-language tasks. While many adversarial scenarios in MLLMs have focused on image captioning and question answering, this work shifts focus to visual grounding—a task critical for object localization and spatial understanding. The introduction of these adversarial attack paradigms represents a meaningful step forward in assessing and understanding vulnerabilities in advanced AI systems like MLLMs.

Conclusion

The findings of this research offer valuable insights into the adversarial vulnerabilities of MLLMs and highlight the necessity for improved robustness in visual grounding tasks. These attack paradigms not only serve as benchmarks for future evaluations but also lay a foundation for developing more secure and resilient multimodal models. Further research could refine these paradigms or introduce novel defensive mechanisms to enhance the reliability of MLLMs in diverse, real-world applications.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.