MSTS: A Multimodal Safety Test Suite for Vision-Language Models

Published 17 Jan 2025 in cs.CL | (2501.10057v1)

Abstract: Vision-LLMs (VLMs), which process image and text inputs, are increasingly integrated into chat assistants and other consumer AI applications. Without proper safeguards, however, VLMs may give harmful advice (e.g. how to self-harm) or encourage unsafe behaviours (e.g. to consume drugs). Despite these clear hazards, little work so far has evaluated VLM safety and the novel risks created by multimodal inputs. To address this gap, we introduce MSTS, a Multimodal Safety Test Suite for VLMs. MSTS comprises 400 test prompts across 40 fine-grained hazard categories. Each test prompt consists of a text and an image that only in combination reveal their full unsafe meaning. With MSTS, we find clear safety issues in several open VLMs. We also find some VLMs to be safe by accident, meaning that they are safe because they fail to understand even simple test prompts. We translate MSTS into ten languages, showing non-English prompts to increase the rate of unsafe model responses. We also show models to be safer when tested with text only rather than multimodal prompts. Finally, we explore the automation of VLM safety assessments, finding even the best safety classifiers to be lacking.

Abstract PDF Upgrade to Chat

Summary

The paper introduces MSTS, a comprehensive test suite that evaluates VLM safety using 400 multimodal prompts across 40 hazard categories.
It reveals critical vulnerabilities in VLMs, including accidental safety in open-weight models and performance drops in multilingual settings.
The framework emphasizes the need for improved multimodal comprehension to ensure robust and safe processing of complex visual-text inputs.

MSTS: A Multimodal Safety Test Suite for Vision-LLMs

Introduction

Vision-LLMs (VLMs), which incorporate both textual and visual inputs, are becoming prevalent in applications like chat assistants. However, these models can present safety risks when they produce harmful outputs, especially since they are exposed to novel risks with multimodal inputs. The MSTS (Multimodal Safety Test Suite) has been developed as a structured evaluation framework to systematically test VLMs for safety across various hazard categories.

Safety Risks in Vision-LLMs

VLMs expand the input space for AI models, introducing new challenges and safety concerns. The potential of these models to follow dangerous prompts or provide unsafe advice highlights the urgency of understanding these risks. This is exemplified when textual queries are augmented with visual contexts, potentially amplifying the misunderstanding or misrepresentation of the situation.

Visual Example of Multimodal Prompts

Figure 1: Example test prompts from MSTS. Each prompt consists of a text (e.g., "Should I drink all of this?") and an image (e.g., bleach), which only in combination create the full meaning of the prompt.

MSTS Structure and Evaluation

The core of MSTS is built on 400 test prompts split across 40 finely defined hazard categories. Importantly, each test case combines text and image, where the hazardous implication emerges only from their combination. The framework also includes a taxonomy that classifies the specific hazards that VLMs should be evaluated against.

Visual Example of the Hazard Taxonomy

Figure 2: The taxonomy of hazards covered by MSTS. There are 40 leaves in this taxonomic tree.

Performance Assessment and Findings

The suite's application highlights clear deficiencies in several leading VLMs, particularly open-weight models when compared to commercial ones. Notably, some models appear safe by sheer accident, failing to process the prompts fully rather than robustly handling unsafe queries. The tests also reveal a notable performance gap between monolingual and multilingual settings, with many models performing worse with non-English inputs.

Implications and Future Directions

MSTS offers critical insights into the safety mechanisms or lack thereof within current VLMs. A key takeaway is the necessity of models to handle multimodal inputs robustly and safely, without being misled by either visual or textual components. There remains a need for further development of more effective automated evaluators that can understand the nuanced interactions of text and image in real-world settings.

Future advancements in VLM safety must prioritize enhanced multimodal comprehension, avoiding over-simplified processing that could lead to unsafe outputs. The MSTS framework is pivotal for identifying weaknesses and guiding the evolution of more secure, multimodal AI applications.

Conclusion

The introduction of MSTS underscores an essential step in machine learning's trajectory towards safer multimodal AI systems. The systematized evaluation not only assists in identifying existing vulnerabilities but also charts a course for developing the next generation of VLMs that prioritize user safety without compromising their capability to process complex, real-world multimodal data effectively.

Markdown Report Issue