- The paper introduces the SI-Attack method that exploits shuffle inconsistency to bypass MLLM safety filters, markedly increasing toxicity scores.
- The methodology employs random shuffling of text and image patches via black-box optimization, challenging the models’ inherent safety mechanisms.
- Experimental validation shows significant rises in attack success rates across benchmarks, underscoring the urgent need for enhanced model safety protocols.
Jailbreaking Multimodal LLMs via Shuffle Inconsistency
The paper "Jailbreaking Multimodal LLMs via Shuffle Inconsistency" addresses a critical vulnerability in Multimodal LLMs (MLLMs) focusing on a novel attack method termed as Shuffle Inconsistency (SI-Attack). This method exploits the inconsistency between MLLMs' comprehension and safety abilities when exposed to shuffled harmful instructions. The authors demonstrate that these models, including both open-source and commercial variants, can be tricked into generating harmful content through a strategic shuffling of input data.
Shuffle Inconsistency Exploration
The concept of Shuffle Inconsistency revolves around the observation that while MLLMs can comprehend shuffled harmful text-image pairs, their safety mechanisms are often bypassed by such input variants. This discrepancy is exemplified by the models' ability to recognize harmful intent even in shuffled formats, yet failing to engage appropriate defensive responses.
Figure 1: Illustration of Shuffle Inconsistency for shuffled harmful instruction. For the comprehension ability, MLLMs can understand both the unshuffled and shuffled harmful text-image pairs; whereas the safety mechanisms are ineffective against shuffled inputs, leading to harmful outputs.
The study investigates the models' response to toxicity scores derived from two conditions: original versus shuffled inputs, utilizing ChatGPT-3.5 for scoring. The results showed that shuffling increases the toxicity score of the MLLMs' outputs, indicating their compromised safety capabilities.

Figure 2: MLLMs' response toxic score for the original and shuffled harmful inputs, showing increased toxicity in shuffled scenarios.
SI-Attack Framework and Implementation
The SI-Attack leverages the Shuffle Inconsistency by fragmenting textual inputs into words and images into patches, which are then randomly shuffled. This harnesses the comprehension capabilities of MLLMs while bypassing their safety filters. The attack employs a query-based black-box optimization strategy to maximize the efficacy of harmful instruction delivery.
Figure 3: Framework of SI-Attack showing iterative shuffling of text and image units to bypass MLLM defenses until successful synthesis of harmful outputs is achieved.
This framework is particularly adept at navigating closed-source MLLMs' outer safety guardrails, which traditionally act as robust barriers against straightforward jailbreak attempts. The iterative optimization ensures the identification of the most potent shuffled configurations for securing harmful responses.
Experimental Validation
The paper details experiments conducted on various benchmarks, including MM-safetybench, HADES, and SafeBench. SI-Attack's efficacy is illustrated through substantial improvements in toxicity scores and attack success rates across both open-source and commercial MLLMs.
For instance, the SI-Attack achieved an increase in attack success rates from 18.21% to 37.98% on open-source models and from 8.87% to 44.82% on closed-source models when using MM-safetybench without typography.
Discussion and Implications
The discovery of Shuffle Inconsistency highlights a significant gap in the alignment of comprehension and safety mechanisms within MLLMs. The paper suggests that while these models possess advanced reasoning capabilities, they inadvertently amplify potential safety risks when these capabilities outpace the corresponding safety mechanisms.
This insight draws attention to the need for enhancing safety protocols in AI systems, ensuring that defense mechanisms scale alongside comprehension advancements. The authors propose further refinement of safety alignments, potentially through enhanced adversarial training and the integration of more sophisticated multi-layered safety checks.
Conclusion
The paper concludes by affirming the robustness of the SI-Attack method in uncovering vulnerabilities within state-of-the-art MLLMs. It serves as a crucial reminder of the complex challenges involved in safeguarding AI systems against adversarial exploits that cleverly manipulate inherent model characteristics. The proposed SI-Attack method provides a novel lens to understand and mitigate multimodal jailbreak scenarios, urging the development of more resilient safety measures in the rapidly evolving landscape of AI.