Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap

Published 30 May 2025 in cs.AI | (2505.24208v1)

Abstract: Ensuring Vision-LLMs (VLMs) generate safe outputs is crucial for their reliable deployment. However, LVLMs suffer from drastic safety degradation compared to their LLM backbone. Even blank or irrelevant images can trigger LVLMs to generate harmful responses to prompts that would otherwise be refused in text-only contexts. The modality gap between image and text representations has been recently hypothesized to contribute to safety degradation of LVLMs. However, if and how the amount of modality gap affects LVLMs' safety is not studied. In this work, we show that the amount of modality gap is highly inversely correlated with VLMs' safety. Then, we show that this modality gap is introduced during pretraining LVLMs and persists through fine-tuning. Inspired by this observation, we propose a regularization to reduce the modality gap during pretraining. Our extensive experiments on LLaVA v1.5, ShareGPT4V, and MiniGPT-4 show that our method substantially improves safety alignment of LVLMs, reducing unsafe rate by up to 16.3% without compromising performance, and can further boost existing defenses by up to 18.2%.

Abstract PDF Upgrade to Chat

Summary

Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap: An Essay

The paper titled "Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap" addresses a pertinent issue in the deployment of Large Vision-Language Models (LVLMs): their vulnerability in safety performance, especially when exposed to multimodal inputs. While LVLMs have demonstrated significant capabilities in diverse tasks, such as visual question answering and multimodal dialogues, their safety alignment is yet to match the robustness of their large language model (LLM) counterparts when only language modalities are involved.

One key issue identified in LVLMs is the modality gap between image and text representations. This gap is hypothesized to exacerbate unsafe output generation, where even irrelevant visual inputs can lead these models to produce harmful responses to prompts considered benign in text-only contexts. This work builds on this hypothesis by exploring and demonstrating that the modality gap, which arises during the pretraining phase due to differences in how image and text tokens are embedded, highly correlates with safety degradation of LVLMs.

Through empirical analysis, the paper establishes that this modality gap is introduced during pretraining and persists through fine-tuning. Recognizing this, the authors propose a regularization called ReGap, aimed at minimizing this gap during pretraining. This regularization strategy employs an effective method—the imposition of an $L_2$ norm-based loss to reduce the distance between image and text embeddings. Crucially, this avoids the need for extensive additional safety data or changes in model architecture.

The proposed method was tested on several open LVLM datasets, including LLaVA v1.5, ShareGPT4V, and MiniGPT-4, and demonstrated substantial improvements in safety alignments. Particularly, ReGap reduced the unsafe rate by up to 16.3% without compromising model performance. Moreover, when combined with existing defenses, ReGap enhanced their effectiveness by up to 18.2%.

Numerical Insights and Claims

Correlation Between Modality Gap and Unsafe Rate: The paper demonstrates a strong inverse relationship between modality gap and safety performance. Models with larger modality gaps exhibit higher unsafe rates.
Impact on Safety Metric Improvements: The reduction in unsafe rates across various benchmarks using ReGap attests to the efficacy of this pretraining approach. Improvement of 24.3% in eliminating unsafe outputs validates the approach.
Efficiency of ReGap in Boosting Other Defenses: ReGap's ability to work synergistically with pre-existing defense strategies indicates its adaptability across model architectures and a variety of datasets.

Implications and Speculations

This research has broad implications for both practical deployments of LVLMs and theoretical understanding of model alignment in multimodal settings. Practically, reducing the modality gap offers a lightweight mechanism to enhance VLM safety which can be crucial in applications requiring high standards of model trust and reliability. Theoretically, identifying the modality gap during pretraining as a key safety factor opens new avenues for understanding and improving robustness in multimodal learning systems.

Looking forward, the findings from this study may influence advancements in AI safety and alignment strategies, potentially spurring further research in multimodal interactions and integrated safety measures. The approach of regularizing embeddings during pretraining to ensure model predictability and constancy in dynamic environments exemplifies an adaptive strategy that others in the field may leverage or build upon.

Overall, "Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap" contributes significant insights into the realm of safe and reliable AI, emphasizing the importance of foundational pretraining settings in determining model performance in complex operational frameworks.