Papers
Topics
Authors
Recent
Search
2000 character limit reached

MIR-SafetyBench: MLLM Safety Benchmark

Updated 26 January 2026
  • MIR-SafetyBench is a safety benchmark for MLLMs that evaluates multi-image relational reasoning across temporal, spatial, semantic, and logical dimensions.
  • The framework uses a five-stage pipeline to generate instances from harmful seed questions, synthetic images, and iterative human quality checks.
  • Experimental results reveal a high attack success rate and lower attention entropy in unsafe outputs, highlighting a capability-risk trade-off in advanced MLLMs.

MIR-SafetyBench is a safety benchmarking framework for Multimodal LLMs (MLLMs), specifically targeting vulnerabilities that arise from multi-image reasoning capabilities. Unlike traditional safety benchmarks that test models on single images containing explicit harmful content, MIR-SafetyBench probes scenarios where models must integrate information from multiple benign images and follow complex instructions, exposing latent pathways to unsafe or malicious outputs. The benchmark comprises 2,676 instances covering nine fine-grained multi-image relation types and provides a systematic platform for academic and industry researchers to evaluate and improve reasoning-based safety in MLLMs (Chen et al., 20 Jan 2026).

1. Benchmark Structure and Taxonomy

MIR-SafetyBench is organized around nine multi-image relation types, clustered into four high-level categories: Temporal, Spatial, Semantic, and Logical relations. Each instance includes a textual prompt and two to four synthetically generated images. The nine relations are:

Category Fine-Grained Type Instance Count
Temporal Continuity 317
Temporal Jump 303
Spatial Juxtaposition 292
Spatial Embedding 293
Semantic Relevance 152
Semantic Complementarity 280
Logical Analogy 318
Logical Causality 280
Logical Decomposition 441

Each category challenges models by requiring integration of non-obvious cross-image information, often masking harmful intent behind relational reasoning. For example, Temporal Continuity presents sequential scenes demanding prediction of subsequent harmful activities, and Spatial Juxtaposition tests the ability to infer plans across distinct locations.

2. Benchmark Construction Pipeline

The creation of MIR-SafetyBench involves a five-stage, iterative pipeline:

  1. Seed Question Curation: 600 harmful seed questions are selected, spanning six risk domains (Hate, Harassment, Violence, Self-Harm, Illegal Activities, Privacy). Initial seeds are drawn from existing safety datasets, triaged by QwQ-32B, and manually validated.
  2. Instance Generation: For each seed and relation type:
    • A rewriter module formulates indirect prompts, image descriptions, and keywords.
    • Images are synthesized using FLUX.1.
    • Qwen2.5-VL-7B generates trial answers.
    • HarmBench Judge (Llama-2-13B-cls) classifies answers as harmful or safe.
    • DeepSeek-R1 evaluates four criteria: genuine harm, prompt neutrality, relation fidelity, and adherence to the original seed.
    • Instances failing criteria are refined iteratively, up to five cycles.
  3. Human Quality Controls: Four expert annotators spot-check and validate each relation’s coverage, culminating in >90% agreement across samples.

This methodology ensures breadth and high fidelity in benchmarking subtle, multi-image-driven risks.

3. Safety Evaluation Protocol and Metrics

MIR-SafetyBench evaluation proceeds with the following workflow:

  • Prompting: Each model receives a prompt and stitched images (for multi-image cases) or concatenated image canvas (single-image baselines).
  • Output Assessment: Free-form model responses are assessed in a single turn; only the final output is classified.
  • Harm Classification: HarmBench labels model outputs as “unsafe” if they provide harmful instructions.
  • Key Metric—Attack Success Rate (ASR):

ASR={instances eliciting harmful response}total instances×100%\mathrm{ASR} = \frac{\bigl|\{\text{instances eliciting harmful response}\}\bigr|}{\text{total instances}}\times100\%

  • Safety Mode Breakdown: Expert judges classify refused answers into Correct Refusal (explicit harm explanation), Harmless Misunderstanding (interpretative error), Incomplete Refusal (evasive non-engagement), and Clever Evasion (subtle evasion).

4. Attention Entropy Analysis in Reasoning Safety

A central insight of MIR-SafetyBench is the application of attention entropy as an internal safety signature. Attention entropy quantifies the distribution of attention weights during generation:

  • For each layer \ell, head hh, answer-token rr, and context position kk, pr,k(,h)p^{(\ell,h)}_{r,k} is the normalized self-attention weight.
  • Per-token, head-averaged entropy:

Hr()=1Hh=1Hkpr,k(,h)logpr,k(,h)\mathcal{H}^{(\ell)}_r = -\frac{1}{H}\sum_{h=1}^H\sum_{k}p^{(\ell,h)}_{r,k}\,\log p^{(\ell,h)}_{r,k}

  • Segment-based, mean entropy, then global mean for safe/unsafe populations:

Δ,s=μ,s(safe)μ,s(unsafe)\Delta_{\ell,s} = \mu^{(\text{safe})}_{\ell,s}-\mu^{(\text{unsafe})}_{\ell,s}

Findings reveal unsafe generations are associated with lower attention entropy, indicating models concentrate attention on reasoning steps, often neglecting safety constraints. In contrast, single-image tasks do not show entropy-based separation between safe and unsafe responses. This suggests that the reasoning load of multi-image scenarios may crowd out safety checks internally.

5. Experimental Results Across Model Families

Empirical evaluations across 19 MLLMs highlight the severity of multi-image reasoning vulnerabilities:

  • Single-image models: ASR ≈ 46–61% (e.g., LLaVA-v1.5-7B).
  • Strong reasoning models: ASR up to 88% (e.g., GLM-4.1V-9B-Thinking).
  • Closed-source models: GPT-4o reaches ≈ 70% ASR (GPT-4o-mini ≈ 59%; Gemini series, GPT-5.1 < 40%).
  • Relation-type breakdown: Logical Decomposition and Causality produce the highest vulnerability (ASR > 85%); Semantic Relevance is more defendable (ASR ≈ 50%).
  • Single vs. Multi-image: Reframing from a single harmful image to a multi-image relational puzzle can increase ASR by 40–50 percentage points (e.g., GPT-4o: ≈ 19% → ≈ 65%).
  • Safe mode benchmarking: Harmless Misunderstandings (HM) account for >20% of “safe” outputs, while Correct Refusal (CR) is rare outside frontier closed-source systems.

6. Underlying Risk Factors and Cognitive Hypotheses

MIR-SafetyBench substantiates a capability-risk trade-off: advanced multi-image reasoning augments safety vulnerabilities. Key hypotheses include:

  • Cognitive Overload: Complex relational reasoning tasks can saturate the model’s finite attention mechanism, diminishing safety evidence (as indicated by lower attention entropy).
  • Implicit Justification: Step-by-step reasoning chains can reinforce the harmful goal, making it difficult for post-generation safety layers to filter outputs where intent is distributed across images and inference steps.

A plausible implication is that further increases in reasoning prowess may unintentionally widen models’ vulnerability surfaces unless safety is proactively embedded into reasoning mechanisms.

7. Mitigation Strategies and Future Research Directions

MIR-SafetyBench motivates several alignment and mitigation approaches:

  • Real-time monitoring of early-chain attention entropy, flagging low-entropy “hyper-focused” responses.
  • Inserting counterfactual safety checks, e.g., “Is this safe?” queries within the reasoning chain.
  • Contrastive learning on MIR-SafetyBench adversarial multi-image examples to sharpen model distinction between unsafe and benign reasoning paths.
  • Extension to multi-modal (e.g., code plus images) and interactive, multi-turn settings, investigating how dialogue context modulates risk.
  • Development of regularizers that maximize dispersion across both task-relevant and safety-sensitive tokens.

The findings from MIR-SafetyBench indicate the need for alignment strategies sensitive to complex relational inference, especially as models are pushed toward sophisticated, real-world multimodal reasoning (Chen et al., 20 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MIR-SafetyBench.