A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations

Published 14 Feb 2025 in cs.CR and cs.CV | (2502.14881v1)

Abstract: With the rapid advancement of Large Vision-LLMs (LVLMs), ensuring their safety has emerged as a crucial area of research. This survey provides a comprehensive analysis of LVLM safety, covering key aspects such as attacks, defenses, and evaluation methods. We introduce a unified framework that integrates these interrelated components, offering a holistic perspective on the vulnerabilities of LVLMs and the corresponding mitigation strategies. Through an analysis of the LVLM lifecycle, we introduce a classification framework that distinguishes between inference and training phases, with further subcategories to provide deeper insights. Furthermore, we highlight limitations in existing research and outline future directions aimed at strengthening the robustness of LVLMs. As part of our research, we conduct a set of safety evaluations on the latest LVLM, Deepseek Janus-Pro, and provide a theoretical analysis of the results. Our findings provide strategic recommendations for advancing LVLM safety and ensuring their secure and reliable deployment in high-stakes, real-world applications. This survey aims to serve as a cornerstone for future research, facilitating the development of models that not only push the boundaries of multimodal intelligence but also adhere to the highest standards of security and ethical integrity. Furthermore, to aid the growing research in this field, we have created a public repository to continuously compile and update the latest work on LVLM safety: https://github.com/XuankunRong/Awesome-LVLM-Safety .

Abstract PDF Upgrade to Chat

Summary

The paper presents a comprehensive framework categorizing LVLM vulnerabilities, including distinct inference and training-phase attack vectors.
It reviews attack methods such as white-box, gray-box, and black-box techniques, illustrating diverse adversarial approaches.
It outlines defense strategies and evaluation metrics designed to robustly safeguard large vision-language models in high-stakes applications.

A Survey of Safety on Large Vision-LLMs: Attacks, Defenses and Evaluations

Introduction

The paper provides a comprehensive survey of safety challenges associated with Large Vision-LLMs (LVLMs), emphasizing attacks, defenses, and evaluations. The integration of vision and language in AI modeling presents unique security vulnerabilities, necessitating a thorough examination of these models' safety. The authors propose a systematic framework to categorize and address these vulnerabilities, providing a holistic view of potential risks and defenses.

Figure 1: Overview of the survey. Best viewed in color.

LVLM Vulnerabilities

LVLMs face intrinsic vulnerabilities related to their multimodal nature. These models, by incorporating visual modalities, introduce new attack vectors distinct from those in text-only models. Visual inputs can cascade vulnerabilities throughout the system, leading to unsafe or erroneous behaviors. The paper categorizes vulnerabilities into inference-phase attacks and training-phase attacks, delineating distinct strategies based on the attacker’s knowledge of the model.

Inference-Phase Attacks

Inference-phase attacks target LVLMs during their operational use, without altering their inherent parameters. The paper classifies these attacks into:

White-Box Attacks: Requiring full access to model details, these attacks use gradients to craft adversarial inputs.
Gray-Box Attacks: Exploit partial model knowledge to craft transferable adversarial examples.
Black-Box Attacks: Operate without direct model details, relying on sophisticated prompt engineering to bypass safety protocols.
Figure 2: Illustration of Inference-Phase Attack Methods. Detailed explanations can be found in \cref{sec: White-box Attacks

Training-Phase Attacks

Training-phase attacks compromise LVLMs by manipulating training data to introduce vulnerabilities. Techniques include:

Label Poisoning Attacks: Altering labels can mislead models during training, resulting in erroneous outputs under benign inputs.
Backdoor Trigger Attacks: Embedding triggers in data can illicit specific responses only when activated by particular inputs.

Defense Strategies

The paper outlines defense mechanisms corresponding to each attack category:

Inference-Phase Defenses: Focus on input sanitation, internal optimization, and output validation, aiming to detect and neutralize attacks during model operation.
Training-Phase Defenses: Involve refining datasets and employing robust training strategies to anticipate and mitigate embedded threats.

Evaluation Approaches

Evaluation of LVLM safety involves benchmarking models under varied attack and defense scenarios. It emphasizes:

Safety Capability Evaluation: Testing models against diverse adversarial inputs to assess their robustness.
Strategy Effectivity: Measuring the success rates of various attack and defense methods under controlled conditions.

Conclusion

The survey highlights the need for continuous innovation in safeguarding LVLMs against evolving threats. Future research should focus on enhancing black-box attack strategies, improving cross-modality alignment, diversifying fine-tuning techniques for safety, and establishing comprehensive benchmarking frameworks to ensure the reliable deployment of LVLMs in high-stakes environments.

Markdown Report Issue