Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multimodal Safety Benchmarks

Updated 14 January 2026
  • Multimodal safety benchmarks are standardized frameworks that define risk taxonomies, modality coverage, and scenario complexity to assess model safety and robustness.
  • They integrate manual and LLM-assisted annotation with synthetic data generation and adversarial augmentation to capture real-world safety challenges.
  • Quantitative metrics such as Attack Success Rate and Cross-Modal Safety Consistency guide improvements in model alignment and mitigate vulnerabilities.

Multimodal safety benchmarks provide standardized datasets, evaluation protocols, and metrics for assessing the safety, robustness, and alignment of multimodal foundation models, which process and generate content conditioned on multiple modalities such as text, images, audio, and video. These benchmarks are foundational for measuring whether such models can reliably detect and refuse unsafe requests, avoid over-cautious refusals of harmless content, and maintain alignment under adversarial manipulation, changing contexts, dynamic prompts, and complex joint-modality attacks.

1. Taxonomy and Design Principles of Multimodal Safety Benchmarks

Multimodal safety benchmarks are structured along three principal axes: risk taxonomy, modality coverage, and scenario complexity.

Risk Taxonomy: Benchmarks organize risk into multi-level taxonomies. USB defines a three-level hierarchy (primary, secondary, tertiary) with 3 primary classes, 16 secondary branches, and 61 tertiary sub-categories, spanning national, public, and ethical safety, including risks such as violent content, privacy leaks, medical misinformation, self-harm, and political hazards (Zheng et al., 26 May 2025). OutSafe-Bench structures content risks into nine explicit categories: privacy & property, prejudice & discrimination, crime & illegal activities, ethics & morality, violence & hatred, misinformation, political sensitivity, physical & mental health, and copyright (Yan et al., 13 Nov 2025). Such granular stratification enables fine-grained measurement of model blind spots.

Modality Combinations: Benchmarks systematically cover all possible risk loci across modalities. USB tests four configurations: RIRT (Risky-Image & Risky-Text), RIST (Risky-Image & Safe-Text), SIRT (Safe-Image & Risky-Text), and SIST (Safe-Image & Safe-Text, where risk emerges only jointly) (Zheng et al., 26 May 2025). Omni-SafetyBench extends this to 24 input format permutations, including audio-only, image+audio, video+text, and full omni-modal (text+image+audio+video), critical for modern OLLMs (Pan et al., 10 Aug 2025).

Scenario Complexity and Dynamics: Recent benchmarks incorporate multi-turn dialogues (SafeMT (Zhu et al., 14 Oct 2025), MTMCS-Bench (Liu et al., 11 Jan 2026)), dynamic benchmark perturbation (SDEval (Wang et al., 8 Aug 2025)), proactive risk assessment (PaSBench (Yuan et al., 23 May 2025)), and situations where safety is only determinable through joint reasoning over context (VLSU (Palaskar et al., 21 Oct 2025), MSSBench (Zhou et al., 2024)). This shift from static, single-turn, text-only benchmarks reflects the need to mirror complex, real-world use cases.

2. Dataset Construction, Annotation, and Generation Techniques

Benchmark construction involves a combination of curated manual annotation, semi-automatic data generation, and adversarial example mining.

Manual and LLM-Assisted Annotation: MMSafeAware (Wang et al., 16 Feb 2025) and VLSU (Palaskar et al., 21 Oct 2025) employ multi-stage human expert annotation, with agreement metrics (e.g., inter-annotator F1 > 91%) and policies mandating that both image and text are jointly rated for safety and for nuanced “borderline” severity cases.

Synthetic and Privacy-Preserving Data Generation: SynSHRP2 (Shi et al., 6 May 2025) uses a pipeline of super-resolution, semantic segmentation, ControlNet-guided Stable Diffusion, and IP-Adapter for de-identified dash-cam synthesis, preserving traffic semantics while eliminating PII. The synthesis process produces crash and near-crash sequences with temporally consistent, privacy-safe keyframes and kinematics.

Adversarial and Dynamic Augmentation: SDEval (Wang et al., 8 Aug 2025) dynamically perturbs safety suites at the text (paraphrase, typo, mixed-language), image (augmentation, style transfer, caption-guided regeneration), and cross-modal (text-image injection, multimodal jailbreaks) levels, yielding benchmarks that target model vulnerabilities unseeable in static datasets and mitigate training data contamination.

Multi-Turn Dialogue Expansion: SafeMT (Zhu et al., 14 Oct 2025) and MTMCS-Bench (Liu et al., 11 Jan 2026) generate, for every risky seed, paired motifs of safe and unsafe dialogues under multiple escalation and context-switch attack strategies, ensuring parity and stress-testing dialogue-tracking resilience.

Automated Quality Control and Attacker-in-the-Loop: USB (Zheng et al., 26 May 2025) and OutSafe-Bench (Yan et al., 13 Nov 2025) implement multi-step data vetting, including LLM-based rejection for low-aggressiveness prompts, attack-boosting loops to ensure successful model compromise, and content alignment verifiers.

3. Evaluation Protocols and Quantitative Metrics

Benchmarks define multi-dimensional metrics to quantify safety robustness, over-blocking, adversarial vulnerability, and cross-modal generalization.

Core Metrics:

Evaluation Protocols:

  • Jury Deliberation: SafeBench (Ying et al., 2024) evaluates each sample independently with a five-model LLM “jury,” using chain-of-thought rationales, peer broadcast/revision, and majority voting.
  • FairScore: OutSafe-Bench (Yan et al., 13 Nov 2025) aggregates reviewer model risk ratings via accuracy-weighted jury voting, ensuring robustness to single-model bias.
  • Dynamic Difficulty Adaptation: SDEval (Wang et al., 8 Aug 2025) includes alpha-tuned perturbation gates for controlled adjustment of data difficulty.
  • Multi-Agent Decomposition: MSSBench (Zhou et al., 2024) and VLSU (Palaskar et al., 21 Oct 2025) analyze error sources by explicit breakdown into vision understanding, intent reasoning, and fusion, with modular evaluation pipelines.

4. Benchmark Tasks and Scenario Coverage

Benchmarks span a diverse set of tasks indicative of real-world, safety-critical use:

Benchmark Modalities Key Tasks / Safety Scenarios
SynSHRP2 Tabular, video, Event attribute classification, scene understanding, crash
image, kinematics severity, incident/conflict detection
USB Text, image Cross-modal attack/oversensitivity in 61 sub-categories
OutSafe-Bench Text, image, Multicategory content detection (privacy, violence, etc.)
audio, video with MCRS metric, cross-modal consistency
Video-SafetyBench Video+text Detect unsafe completions under harmful & benign prompts
Omni-SafetyBench All (text, image, Multi-format attack/defense, cross-modal safety consistency
audio, video)
AccidentBench Video Temporal, spatial, intent reasoning across accident/traffic
Automotive-ENV Image, GUI, GPS Explicit/implicit control, region-aware safety adaptation
SafeMT, MTMCS Image, text Multi-turn escalation/context switch, dialogue ASR vs. SI
SDEval Image, text Dynamic distribution shift injection, stress-testing
MSSBench, VLSU Image, text Situational risk where context determines safety

Scenarios include but are not limited to:

  • Physically hazardous activities, explicit/implicit violence, weapon manufacture, self-harm, privacy and PII leaks, legal/financial/deceptive instruction, offensive/biased content, medical misinformation, political manipulation, and proactive everyday risks (PaSBench (Yuan et al., 23 May 2025)).

5. Benchmark Findings, Common Failure Modes, and Insights

Safety–Utility Trade-off: Persistent across studies, models that refuse more often over-block, hurting helpfulness, while permissive models miss subtler unsafe cues (VLSU (Palaskar et al., 21 Oct 2025), USB (Zheng et al., 26 May 2025), MTMCS-Bench (Liu et al., 11 Jan 2026)). No current model achieves low ASR and ARR simultaneously.

Blind Spots in Cross-Modal Reasoning: Models excel at unimodal safe/unsafe cues (>90%), but accuracy drops to 20–55% when fusion is required, especially on cases where each modality appears safe in isolation but the combination is hazardous (VLSU (Palaskar et al., 21 Oct 2025)). About 34% of such misclassifications occur despite both unimodal predictions being correct, underscoring compositionality failures.

Temporal and Proactive Risk: Models are more susceptible to temporally extended, referential, or multi-turn attacks (SafeMT, Video-SafetyBench, AccidentBench), and proactive risk remains a challenge even when underlying safety knowledge is present (PaSBench (Yuan et al., 23 May 2025)).

Dialogue and Escalation: ASR increases with number of turns (SafeMT (Zhu et al., 14 Oct 2025)). Moderation modules (e.g., ChatShield) and prompt-adaptive shields reduce 8-turn ASR by up to 20 pp, yet the best models plateau at ~57% safe response rate to daily-life hazards (SaLAD (Lou et al., 7 Jan 2026)).

Over/Under-Blocking and Calibration: Alignment methods (VLGuard, MIS, SPA-VL) may raise refusal rate but still miss context-dependent hazards or issue vague warnings (SaLAD (Lou et al., 7 Jan 2026), MMSafeAware (Wang et al., 16 Feb 2025)). Improvements in explicit chain-of-thought or modular agent pipelines bolster safety-awareness but yield diminishing returns above 75-80% on best-in-class MLLMs (MSSBench (Zhou et al., 2024), SafeBench (Ying et al., 2024)).

Complex Modality Attacks: Safety rapidly degrades in full multimodal settings (Omni-SafetyBench), with multi-modal attacks (e.g., audio+image+text) bypassing single-modality filters and yielding lowest CMSC-scores (median ~0.6–0.8 vs. unimodal ~0.9) (Pan et al., 10 Aug 2025).

6. Benchmark Development, Best Practices, and Outlook

Principled Design Strategies:

Limitations and Future Directions:

  • Scaling scenario diversity and data size (USB’s synthetic augmentation, SafeBench’s LLM-guided risk taxonomy updates).
  • Extension to longer, real-world or open-ended sequences (AccidentBench, PaSBench).
  • Integrated evaluation for aligned, context-aware, and cross-modally robust models.
  • Real-time adjudication with dynamic multi-agent/jury composition and human-in-the-loop for ambiguous or high-stakes output.
  • Incorporation of fine-grained, severity-weighted metrics and adversarial challenge suites to address rare or emerging risks.

Summary Table: Representative Benchmarks and Their Safety Axes

Benchmark Modalities Key Safety Axes Evaluation Highlights
USB Text, image Vulnerability & oversensitivity; 61 categories Dual metrics (ASR, ARR), coverage gap analysis
OutSafe-Bench Text, image, audio, video 9 risk categories, cross-risk scoring MCRS, FairScore multi-LLM reviewer
Video-SafetyBench Video+text Benign referential, temporal RJScore, domain calibration
SafeBench Text, image, audio 23 scenario taxonomy, jury protocol LLM jury, SRI/ASR
SafeMT/MTMCS Image, text Multi-turn escalation/context Safety Index (SI), dialogue spectrum
SDEval Image, text Dynamic perturbation Distributional drift, leakage stress
MSSBench/VLSU Image, text Situational/joint reasoning 17 pattern taxonomy, over/under-blocking

In current practice, no single benchmark is sufficient for exhaustive safety assessment. Comprehensive evaluation suites must jointly address modality, risk, scenario, and real-world compositionality, continuously refreshed to match the evolving risk landscape and model capabilities (Zheng et al., 26 May 2025, Palaskar et al., 21 Oct 2025, Yan et al., 13 Nov 2025, Wang et al., 8 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Safety Benchmarks.