Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Published 9 Jan 2025 in cs.CR, cs.AI, and cs.CL | (2501.04931v2)

Abstract: Multimodal LLMs (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs' potential risks. Existing MLLMs' jailbreak methods often bypass the model's safety mechanism through complex optimization methods or carefully designed image and text prompts. Despite achieving some progress, they have a low attack success rate on commercial closed-source MLLMs. Unlike previous research, we empirically find that there exists a Shuffle Inconsistency between MLLMs' comprehension ability and safety ability for the shuffled harmful instruction. That is, from the perspective of comprehension ability, MLLMs can understand the shuffled harmful text-image instructions well. However, they can be easily bypassed by the shuffled harmful instructions from the perspective of safety ability, leading to harmful responses. Then we innovatively propose a text-image jailbreak attack named SI-Attack. Specifically, to fully utilize the Shuffle Inconsistency and overcome the shuffle randomness, we apply a query-based black-box optimization method to select the most harmful shuffled inputs based on the feedback of the toxic judge model. A series of experiments show that SI-Attack can improve the attack's performance on three benchmarks. In particular, SI-Attack can obviously improve the attack success rate for commercial MLLMs such as GPT-4o or Claude-3.5-Sonnet.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the SI-Attack method that exploits shuffle inconsistency to bypass MLLM safety filters, markedly increasing toxicity scores.
The methodology employs random shuffling of text and image patches via black-box optimization, challenging the models’ inherent safety mechanisms.
Experimental validation shows significant rises in attack success rates across benchmarks, underscoring the urgent need for enhanced model safety protocols.

Jailbreaking Multimodal LLMs via Shuffle Inconsistency

The paper "Jailbreaking Multimodal LLMs via Shuffle Inconsistency" addresses a critical vulnerability in Multimodal LLMs (MLLMs) focusing on a novel attack method termed as Shuffle Inconsistency (SI-Attack). This method exploits the inconsistency between MLLMs' comprehension and safety abilities when exposed to shuffled harmful instructions. The authors demonstrate that these models, including both open-source and commercial variants, can be tricked into generating harmful content through a strategic shuffling of input data.

Shuffle Inconsistency Exploration

The concept of Shuffle Inconsistency revolves around the observation that while MLLMs can comprehend shuffled harmful text-image pairs, their safety mechanisms are often bypassed by such input variants. This discrepancy is exemplified by the models' ability to recognize harmful intent even in shuffled formats, yet failing to engage appropriate defensive responses.

Figure 1: Illustration of Shuffle Inconsistency for shuffled harmful instruction. For the comprehension ability, MLLMs can understand both the unshuffled and shuffled harmful text-image pairs; whereas the safety mechanisms are ineffective against shuffled inputs, leading to harmful outputs.

The study investigates the models' response to toxicity scores derived from two conditions: original versus shuffled inputs, utilizing ChatGPT-3.5 for scoring. The results showed that shuffling increases the toxicity score of the MLLMs' outputs, indicating their compromised safety capabilities.

Figure 2: MLLMs' response toxic score for the original and shuffled harmful inputs, showing increased toxicity in shuffled scenarios.

SI-Attack Framework and Implementation

The SI-Attack leverages the Shuffle Inconsistency by fragmenting textual inputs into words and images into patches, which are then randomly shuffled. This harnesses the comprehension capabilities of MLLMs while bypassing their safety filters. The attack employs a query-based black-box optimization strategy to maximize the efficacy of harmful instruction delivery.

Figure 3: Framework of SI-Attack showing iterative shuffling of text and image units to bypass MLLM defenses until successful synthesis of harmful outputs is achieved.

This framework is particularly adept at navigating closed-source MLLMs' outer safety guardrails, which traditionally act as robust barriers against straightforward jailbreak attempts. The iterative optimization ensures the identification of the most potent shuffled configurations for securing harmful responses.

Experimental Validation

The paper details experiments conducted on various benchmarks, including MM-safetybench, HADES, and SafeBench. SI-Attack's efficacy is illustrated through substantial improvements in toxicity scores and attack success rates across both open-source and commercial MLLMs.

For instance, the SI-Attack achieved an increase in attack success rates from 18.21% to 37.98% on open-source models and from 8.87% to 44.82% on closed-source models when using MM-safetybench without typography.

Discussion and Implications

The discovery of Shuffle Inconsistency highlights a significant gap in the alignment of comprehension and safety mechanisms within MLLMs. The paper suggests that while these models possess advanced reasoning capabilities, they inadvertently amplify potential safety risks when these capabilities outpace the corresponding safety mechanisms.

This insight draws attention to the need for enhancing safety protocols in AI systems, ensuring that defense mechanisms scale alongside comprehension advancements. The authors propose further refinement of safety alignments, potentially through enhanced adversarial training and the integration of more sophisticated multi-layered safety checks.

Conclusion

The paper concludes by affirming the robustness of the SI-Attack method in uncovering vulnerabilities within state-of-the-art MLLMs. It serves as a crucial reminder of the complex challenges involved in safeguarding AI systems against adversarial exploits that cleverly manipulate inherent model characteristics. The proposed SI-Attack method provides a novel lens to understand and mitigate multimodal jailbreak scenarios, urging the development of more resilient safety measures in the rapidly evolving landscape of AI.

Markdown Report Issue