Multimodal Safety Benchmarks

Updated 14 January 2026

Multimodal safety benchmarks are standardized frameworks that define risk taxonomies, modality coverage, and scenario complexity to assess model safety and robustness.
They integrate manual and LLM-assisted annotation with synthetic data generation and adversarial augmentation to capture real-world safety challenges.
Quantitative metrics such as Attack Success Rate and Cross-Modal Safety Consistency guide improvements in model alignment and mitigate vulnerabilities.

Multimodal safety benchmarks provide standardized datasets, evaluation protocols, and metrics for assessing the safety, robustness, and alignment of multimodal foundation models, which process and generate content conditioned on multiple modalities such as text, images, audio, and video. These benchmarks are foundational for measuring whether such models can reliably detect and refuse unsafe requests, avoid over-cautious refusals of harmless content, and maintain alignment under adversarial manipulation, changing contexts, dynamic prompts, and complex joint-modality attacks.

1. Taxonomy and Design Principles of Multimodal Safety Benchmarks

Multimodal safety benchmarks are structured along three principal axes: risk taxonomy, modality coverage, and scenario complexity.

Risk Taxonomy: Benchmarks organize risk into multi-level taxonomies. USB defines a three-level hierarchy (primary, secondary, tertiary) with 3 primary classes, 16 secondary branches, and 61 tertiary sub-categories, spanning national, public, and ethical safety, including risks such as violent content, privacy leaks, medical misinformation, self-harm, and political hazards (Zheng et al., 26 May 2025). OutSafe-Bench structures content risks into nine explicit categories: privacy & property, prejudice & discrimination, crime & illegal activities, ethics & morality, violence & hatred, misinformation, political sensitivity, physical & mental health, and copyright (Yan et al., 13 Nov 2025). Such granular stratification enables fine-grained measurement of model blind spots.

Modality Combinations: Benchmarks systematically cover all possible risk loci across modalities. USB tests four configurations: RIRT (Risky-Image & Risky-Text), RIST (Risky-Image & Safe-Text), SIRT (Safe-Image & Risky-Text), and SIST (Safe-Image & Safe-Text, where risk emerges only jointly) (Zheng et al., 26 May 2025). Omni-SafetyBench extends this to 24 input format permutations, including audio-only, image+audio, video+text, and full omni-modal (text+image+audio+video), critical for modern OLLMs (Pan et al., 10 Aug 2025).

Scenario Complexity and Dynamics: Recent benchmarks incorporate multi-turn dialogues (SafeMT (Zhu et al., 14 Oct 2025), MTMCS-Bench (Liu et al., 11 Jan 2026)), dynamic benchmark perturbation (SDEval (Wang et al., 8 Aug 2025)), proactive risk assessment (PaSBench (Yuan et al., 23 May 2025)), and situations where safety is only determinable through joint reasoning over context (VLSU (Palaskar et al., 21 Oct 2025), MSSBench (Zhou et al., 2024)). This shift from static, single-turn, text-only benchmarks reflects the need to mirror complex, real-world use cases.

2. Dataset Construction, Annotation, and Generation Techniques

Benchmark construction involves a combination of curated manual annotation, semi-automatic data generation, and adversarial example mining.

Manual and LLM-Assisted Annotation: MMSafeAware (Wang et al., 16 Feb 2025) and VLSU (Palaskar et al., 21 Oct 2025) employ multi-stage human expert annotation, with agreement metrics (e.g., inter-annotator F1 > 91%) and policies mandating that both image and text are jointly rated for safety and for nuanced “borderline” severity cases.

Synthetic and Privacy-Preserving Data Generation: SynSHRP2 (Shi et al., 6 May 2025) uses a pipeline of super-resolution, semantic segmentation, ControlNet-guided Stable Diffusion, and IP-Adapter for de-identified dash-cam synthesis, preserving traffic semantics while eliminating PII. The synthesis process produces crash and near-crash sequences with temporally consistent, privacy-safe keyframes and kinematics.

Adversarial and Dynamic Augmentation: SDEval (Wang et al., 8 Aug 2025) dynamically perturbs safety suites at the text (paraphrase, typo, mixed-language), image (augmentation, style transfer, caption-guided regeneration), and cross-modal (text-image injection, multimodal jailbreaks) levels, yielding benchmarks that target model vulnerabilities unseeable in static datasets and mitigate training data contamination.

Multi-Turn Dialogue Expansion: SafeMT (Zhu et al., 14 Oct 2025) and MTMCS-Bench (Liu et al., 11 Jan 2026) generate, for every risky seed, paired motifs of safe and unsafe dialogues under multiple escalation and context-switch attack strategies, ensuring parity and stress-testing dialogue-tracking resilience.

Automated Quality Control and Attacker-in-the-Loop: USB (Zheng et al., 26 May 2025) and OutSafe-Bench (Yan et al., 13 Nov 2025) implement multi-step data vetting, including LLM-based rejection for low-aggressiveness prompts, attack-boosting loops to ensure successful model compromise, and content alignment verifiers.

3. Evaluation Protocols and Quantitative Metrics

Benchmarks define multi-dimensional metrics to quantify safety robustness, over-blocking, adversarial vulnerability, and cross-modal generalization.

Core Metrics:

Attack Success Rate (ASR): Fraction of harmful test cases that elicit unsafe outputs under evaluation; lower is safer (Zheng et al., 26 May 2025, Ying et al., 2024, Yan et al., 13 Nov 2025). Some variants factor out comprehension failures (C-ASR in Omni-SafetyBench) (Pan et al., 10 Aug 2025).
Oversensitivity/Refusal Rate (ARR, FPR): Fraction of harmless cases wrongly refused; measures over-blocking (Zheng et al., 26 May 2025, Wang et al., 16 Feb 2025).
Safety Risk Index (SRI): SRI = $100 \times (1 - \frac{1}{5n}\sum_{i=1}^n (5-s_i))$ aggregates threat severity over all items, as averaged from jury LLMs (Ying et al., 2024).
Multidimensional Cross Risk Score (MCRS): Combines per-category risk severity using a cross-risk influence matrix $γ$ to account for correlated failure modes (Yan et al., 13 Nov 2025).
Cross-Modal Safety Consistency Score (CMSC): $\mathrm{CMSC} = \exp(-\alpha \sigma)$ , where $\sigma$ is the standard deviation of modality-specific safety-scores, promotes stable defense across format shifts (Pan et al., 10 Aug 2025).
Proactive Safety Index (SI): In SafeMT, SI penalizes early-turn model failures and reward consistency over dialogue — $\mathit{SI} = (1 - \textstyle \frac{\sum_k e^{-k}\mathrm{ASR}_k}{\sum_k e^{-k}}) \times (1 - \mathrm{mean}(\sigma[I_1,\dots,I_n]))$ (Zhu et al., 14 Oct 2025).

Evaluation Protocols:

Jury Deliberation: SafeBench (Ying et al., 2024) evaluates each sample independently with a five-model LLM “jury,” using chain-of-thought rationales, peer broadcast/revision, and majority voting.
FairScore: OutSafe-Bench (Yan et al., 13 Nov 2025) aggregates reviewer model risk ratings via accuracy-weighted jury voting, ensuring robustness to single-model bias.
Dynamic Difficulty Adaptation: SDEval (Wang et al., 8 Aug 2025) includes alpha-tuned perturbation gates for controlled adjustment of data difficulty.
Multi-Agent Decomposition: MSSBench (Zhou et al., 2024) and VLSU (Palaskar et al., 21 Oct 2025) analyze error sources by explicit breakdown into vision understanding, intent reasoning, and fusion, with modular evaluation pipelines.

4. Benchmark Tasks and Scenario Coverage

Benchmarks span a diverse set of tasks indicative of real-world, safety-critical use:

Benchmark	Modalities	Key Tasks / Safety Scenarios
SynSHRP2	Tabular, video,	Event attribute classification, scene understanding, crash
	image, kinematics	severity, incident/conflict detection
USB	Text, image	Cross-modal attack/oversensitivity in 61 sub-categories
OutSafe-Bench	Text, image,	Multicategory content detection (privacy, violence, etc.)
	audio, video	with MCRS metric, cross-modal consistency
Video-SafetyBench	Video+text	Detect unsafe completions under harmful & benign prompts
Omni-SafetyBench	All (text, image,	Multi-format attack/defense, cross-modal safety consistency
	audio, video)
AccidentBench	Video	Temporal, spatial, intent reasoning across accident/traffic
Automotive-ENV	Image, GUI, GPS	Explicit/implicit control, region-aware safety adaptation
SafeMT, MTMCS	Image, text	Multi-turn escalation/context switch, dialogue ASR vs. SI
SDEval	Image, text	Dynamic distribution shift injection, stress-testing
MSSBench, VLSU	Image, text	Situational risk where context determines safety

Scenarios include but are not limited to:

Physically hazardous activities, explicit/implicit violence, weapon manufacture, self-harm, privacy and PII leaks, legal/financial/deceptive instruction, offensive/biased content, medical misinformation, political manipulation, and proactive everyday risks (PaSBench (Yuan et al., 23 May 2025)).

5. Benchmark Findings, Common Failure Modes, and Insights

Safety–Utility Trade-off: Persistent across studies, models that refuse more often over-block, hurting helpfulness, while permissive models miss subtler unsafe cues (VLSU (Palaskar et al., 21 Oct 2025), USB (Zheng et al., 26 May 2025), MTMCS-Bench (Liu et al., 11 Jan 2026)). No current model achieves low ASR and ARR simultaneously.

Blind Spots in Cross-Modal Reasoning: Models excel at unimodal safe/unsafe cues (>90%), but accuracy drops to 20–55% when fusion is required, especially on cases where each modality appears safe in isolation but the combination is hazardous (VLSU (Palaskar et al., 21 Oct 2025)). About 34% of such misclassifications occur despite both unimodal predictions being correct, underscoring compositionality failures.

Temporal and Proactive Risk: Models are more susceptible to temporally extended, referential, or multi-turn attacks (SafeMT, Video-SafetyBench, AccidentBench), and proactive risk remains a challenge even when underlying safety knowledge is present (PaSBench (Yuan et al., 23 May 2025)).

Dialogue and Escalation: ASR increases with number of turns (SafeMT (Zhu et al., 14 Oct 2025)). Moderation modules (e.g., ChatShield) and prompt-adaptive shields reduce 8-turn ASR by up to 20 pp, yet the best models plateau at ~57% safe response rate to daily-life hazards (SaLAD (Lou et al., 7 Jan 2026)).

Over/Under-Blocking and Calibration: Alignment methods (VLGuard, MIS, SPA-VL) may raise refusal rate but still miss context-dependent hazards or issue vague warnings (SaLAD (Lou et al., 7 Jan 2026), MMSafeAware (Wang et al., 16 Feb 2025)). Improvements in explicit chain-of-thought or modular agent pipelines bolster safety-awareness but yield diminishing returns above 75-80% on best-in-class MLLMs (MSSBench (Zhou et al., 2024), SafeBench (Ying et al., 2024)).

Complex Modality Attacks: Safety rapidly degrades in full multimodal settings (Omni-SafetyBench), with multi-modal attacks (e.g., audio+image+text) bypassing single-modality filters and yielding lowest CMSC-scores (median ~0.6–0.8 vs. unimodal ~0.9) (Pan et al., 10 Aug 2025).

6. Benchmark Development, Best Practices, and Outlook

Principled Design Strategies:

Simultaneous evaluation of vulnerability and oversensitivity (Zheng et al., 26 May 2025).
Dynamic or adversarial augmentation to keep pace with model advances (Wang et al., 8 Aug 2025).
Use of chain-of-thought inspection, structured reward, and explicit cross-modal cues in reward shaping (Yi et al., 8 Oct 2025, Lou et al., 10 May 2025).
Emphasis on proactive, not just reactive, risk recognition (Yuan et al., 23 May 2025).
Multimodal jury or multi-reviewer protocols to assure robustness against judgment bias (Ying et al., 2024, Yan et al., 13 Nov 2025).
Granular scenario coverage (proactive, multi-turn, time-series, situational, GUI/context-adaptive).

Limitations and Future Directions:

Scaling scenario diversity and data size (USB’s synthetic augmentation, SafeBench’s LLM-guided risk taxonomy updates).
Extension to longer, real-world or open-ended sequences (AccidentBench, PaSBench).
Integrated evaluation for aligned, context-aware, and cross-modally robust models.
Real-time adjudication with dynamic multi-agent/jury composition and human-in-the-loop for ambiguous or high-stakes output.
Incorporation of fine-grained, severity-weighted metrics and adversarial challenge suites to address rare or emerging risks.

Summary Table: Representative Benchmarks and Their Safety Axes

Benchmark	Modalities	Key Safety Axes	Evaluation Highlights
USB	Text, image	Vulnerability & oversensitivity; 61 categories	Dual metrics (ASR, ARR), coverage gap analysis
OutSafe-Bench	Text, image, audio, video	9 risk categories, cross-risk scoring	MCRS, FairScore multi-LLM reviewer
Video-SafetyBench	Video+text	Benign referential, temporal	RJScore, domain calibration
SafeBench	Text, image, audio	23 scenario taxonomy, jury protocol	LLM jury, SRI/ASR
SafeMT/MTMCS	Image, text	Multi-turn escalation/context	Safety Index (SI), dialogue spectrum
SDEval	Image, text	Dynamic perturbation	Distributional drift, leakage stress
MSSBench/VLSU	Image, text	Situational/joint reasoning	17 pattern taxonomy, over/under-blocking

In current practice, no single benchmark is sufficient for exhaustive safety assessment. Comprehensive evaluation suites must jointly address modality, risk, scenario, and real-world compositionality, continuously refreshed to match the evolving risk landscape and model capabilities (Zheng et al., 26 May 2025, Palaskar et al., 21 Oct 2025, Yan et al., 13 Nov 2025, Wang et al., 8 Aug 2025).