Defenses for Structure-Based MLLM Jailbreaks

Develop effective defense mechanisms for structure-based jailbreak attacks on multimodal large language models (MLLMs), where harmful content is embedded within images alongside crafted textual instructions to bypass safety alignment, and demonstrate that such defenses can reliably prevent unauthorized or harmful model outputs across relevant threat scenarios.

Background

The paper categorizes jailbreak attacks on multimodal LLMs (MLLMs) into perturbation-based and structure-based types. While perturbation-based attacks have been extensively studied and defended, structure-based attacks embed harmful content within images and pair it with tailored instructions to evade safety alignment, creating more substantial challenges for defenses.

Within this context, the authors explicitly note that developing effective defense mechanisms for structure-based attacks remains an open research problem. They further highlight that black-box settings exacerbate the difficulty due to realistic and stronger threat assumptions, underscoring the need for robust, generalizable defenses.

References

In contrast to the relatively minor perturbations characteristic of the first category, structure-based attacks pose a more profound challenge, as the development of effective defense mechanisms remains an open research problem [dress,adashield,mllmprotector,ECSO,jailguard].

— Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses (2510.21214 - Zhong et al., 24 Oct 2025) in Section 2.2 (Jailbreak Attacks on MLLMs)

Defenses for Structure-Based MLLM Jailbreaks

Background

References

Related Problems