MIR-SafetyBench: MLLM Safety Benchmark
- MIR-SafetyBench is a safety benchmark for MLLMs that evaluates multi-image relational reasoning across temporal, spatial, semantic, and logical dimensions.
- The framework uses a five-stage pipeline to generate instances from harmful seed questions, synthetic images, and iterative human quality checks.
- Experimental results reveal a high attack success rate and lower attention entropy in unsafe outputs, highlighting a capability-risk trade-off in advanced MLLMs.
MIR-SafetyBench is a safety benchmarking framework for Multimodal LLMs (MLLMs), specifically targeting vulnerabilities that arise from multi-image reasoning capabilities. Unlike traditional safety benchmarks that test models on single images containing explicit harmful content, MIR-SafetyBench probes scenarios where models must integrate information from multiple benign images and follow complex instructions, exposing latent pathways to unsafe or malicious outputs. The benchmark comprises 2,676 instances covering nine fine-grained multi-image relation types and provides a systematic platform for academic and industry researchers to evaluate and improve reasoning-based safety in MLLMs (Chen et al., 20 Jan 2026).
1. Benchmark Structure and Taxonomy
MIR-SafetyBench is organized around nine multi-image relation types, clustered into four high-level categories: Temporal, Spatial, Semantic, and Logical relations. Each instance includes a textual prompt and two to four synthetically generated images. The nine relations are:
| Category | Fine-Grained Type | Instance Count |
|---|---|---|
| Temporal | Continuity | 317 |
| Temporal | Jump | 303 |
| Spatial | Juxtaposition | 292 |
| Spatial | Embedding | 293 |
| Semantic | Relevance | 152 |
| Semantic | Complementarity | 280 |
| Logical | Analogy | 318 |
| Logical | Causality | 280 |
| Logical | Decomposition | 441 |
Each category challenges models by requiring integration of non-obvious cross-image information, often masking harmful intent behind relational reasoning. For example, Temporal Continuity presents sequential scenes demanding prediction of subsequent harmful activities, and Spatial Juxtaposition tests the ability to infer plans across distinct locations.
2. Benchmark Construction Pipeline
The creation of MIR-SafetyBench involves a five-stage, iterative pipeline:
- Seed Question Curation: 600 harmful seed questions are selected, spanning six risk domains (Hate, Harassment, Violence, Self-Harm, Illegal Activities, Privacy). Initial seeds are drawn from existing safety datasets, triaged by QwQ-32B, and manually validated.
- Instance Generation: For each seed and relation type:
- A rewriter module formulates indirect prompts, image descriptions, and keywords.
- Images are synthesized using FLUX.1.
- Qwen2.5-VL-7B generates trial answers.
- HarmBench Judge (Llama-2-13B-cls) classifies answers as harmful or safe.
- DeepSeek-R1 evaluates four criteria: genuine harm, prompt neutrality, relation fidelity, and adherence to the original seed.
- Instances failing criteria are refined iteratively, up to five cycles.
- Human Quality Controls: Four expert annotators spot-check and validate each relation’s coverage, culminating in >90% agreement across samples.
This methodology ensures breadth and high fidelity in benchmarking subtle, multi-image-driven risks.
3. Safety Evaluation Protocol and Metrics
MIR-SafetyBench evaluation proceeds with the following workflow:
- Prompting: Each model receives a prompt and stitched images (for multi-image cases) or concatenated image canvas (single-image baselines).
- Output Assessment: Free-form model responses are assessed in a single turn; only the final output is classified.
- Harm Classification: HarmBench labels model outputs as “unsafe” if they provide harmful instructions.
- Key Metric—Attack Success Rate (ASR):
- Safety Mode Breakdown: Expert judges classify refused answers into Correct Refusal (explicit harm explanation), Harmless Misunderstanding (interpretative error), Incomplete Refusal (evasive non-engagement), and Clever Evasion (subtle evasion).
4. Attention Entropy Analysis in Reasoning Safety
A central insight of MIR-SafetyBench is the application of attention entropy as an internal safety signature. Attention entropy quantifies the distribution of attention weights during generation:
- For each layer , head , answer-token , and context position , is the normalized self-attention weight.
- Per-token, head-averaged entropy:
- Segment-based, mean entropy, then global mean for safe/unsafe populations:
Findings reveal unsafe generations are associated with lower attention entropy, indicating models concentrate attention on reasoning steps, often neglecting safety constraints. In contrast, single-image tasks do not show entropy-based separation between safe and unsafe responses. This suggests that the reasoning load of multi-image scenarios may crowd out safety checks internally.
5. Experimental Results Across Model Families
Empirical evaluations across 19 MLLMs highlight the severity of multi-image reasoning vulnerabilities:
- Single-image models: ASR ≈ 46–61% (e.g., LLaVA-v1.5-7B).
- Strong reasoning models: ASR up to 88% (e.g., GLM-4.1V-9B-Thinking).
- Closed-source models: GPT-4o reaches ≈ 70% ASR (GPT-4o-mini ≈ 59%; Gemini series, GPT-5.1 < 40%).
- Relation-type breakdown: Logical Decomposition and Causality produce the highest vulnerability (ASR > 85%); Semantic Relevance is more defendable (ASR ≈ 50%).
- Single vs. Multi-image: Reframing from a single harmful image to a multi-image relational puzzle can increase ASR by 40–50 percentage points (e.g., GPT-4o: ≈ 19% → ≈ 65%).
- Safe mode benchmarking: Harmless Misunderstandings (HM) account for >20% of “safe” outputs, while Correct Refusal (CR) is rare outside frontier closed-source systems.
6. Underlying Risk Factors and Cognitive Hypotheses
MIR-SafetyBench substantiates a capability-risk trade-off: advanced multi-image reasoning augments safety vulnerabilities. Key hypotheses include:
- Cognitive Overload: Complex relational reasoning tasks can saturate the model’s finite attention mechanism, diminishing safety evidence (as indicated by lower attention entropy).
- Implicit Justification: Step-by-step reasoning chains can reinforce the harmful goal, making it difficult for post-generation safety layers to filter outputs where intent is distributed across images and inference steps.
A plausible implication is that further increases in reasoning prowess may unintentionally widen models’ vulnerability surfaces unless safety is proactively embedded into reasoning mechanisms.
7. Mitigation Strategies and Future Research Directions
MIR-SafetyBench motivates several alignment and mitigation approaches:
- Real-time monitoring of early-chain attention entropy, flagging low-entropy “hyper-focused” responses.
- Inserting counterfactual safety checks, e.g., “Is this safe?” queries within the reasoning chain.
- Contrastive learning on MIR-SafetyBench adversarial multi-image examples to sharpen model distinction between unsafe and benign reasoning paths.
- Extension to multi-modal (e.g., code plus images) and interactive, multi-turn settings, investigating how dialogue context modulates risk.
- Development of regularizers that maximize dispersion across both task-relevant and safety-sensitive tokens.
The findings from MIR-SafetyBench indicate the need for alignment strategies sensitive to complex relational inference, especially as models are pushed toward sophisticated, real-world multimodal reasoning (Chen et al., 20 Jan 2026).