Multimodal Safety Benchmarks
- Multimodal safety benchmarks are standardized frameworks that define risk taxonomies, modality coverage, and scenario complexity to assess model safety and robustness.
- They integrate manual and LLM-assisted annotation with synthetic data generation and adversarial augmentation to capture real-world safety challenges.
- Quantitative metrics such as Attack Success Rate and Cross-Modal Safety Consistency guide improvements in model alignment and mitigate vulnerabilities.
Multimodal safety benchmarks provide standardized datasets, evaluation protocols, and metrics for assessing the safety, robustness, and alignment of multimodal foundation models, which process and generate content conditioned on multiple modalities such as text, images, audio, and video. These benchmarks are foundational for measuring whether such models can reliably detect and refuse unsafe requests, avoid over-cautious refusals of harmless content, and maintain alignment under adversarial manipulation, changing contexts, dynamic prompts, and complex joint-modality attacks.
1. Taxonomy and Design Principles of Multimodal Safety Benchmarks
Multimodal safety benchmarks are structured along three principal axes: risk taxonomy, modality coverage, and scenario complexity.
Risk Taxonomy: Benchmarks organize risk into multi-level taxonomies. USB defines a three-level hierarchy (primary, secondary, tertiary) with 3 primary classes, 16 secondary branches, and 61 tertiary sub-categories, spanning national, public, and ethical safety, including risks such as violent content, privacy leaks, medical misinformation, self-harm, and political hazards (Zheng et al., 26 May 2025). OutSafe-Bench structures content risks into nine explicit categories: privacy & property, prejudice & discrimination, crime & illegal activities, ethics & morality, violence & hatred, misinformation, political sensitivity, physical & mental health, and copyright (Yan et al., 13 Nov 2025). Such granular stratification enables fine-grained measurement of model blind spots.
Modality Combinations: Benchmarks systematically cover all possible risk loci across modalities. USB tests four configurations: RIRT (Risky-Image & Risky-Text), RIST (Risky-Image & Safe-Text), SIRT (Safe-Image & Risky-Text), and SIST (Safe-Image & Safe-Text, where risk emerges only jointly) (Zheng et al., 26 May 2025). Omni-SafetyBench extends this to 24 input format permutations, including audio-only, image+audio, video+text, and full omni-modal (text+image+audio+video), critical for modern OLLMs (Pan et al., 10 Aug 2025).
Scenario Complexity and Dynamics: Recent benchmarks incorporate multi-turn dialogues (SafeMT (Zhu et al., 14 Oct 2025), MTMCS-Bench (Liu et al., 11 Jan 2026)), dynamic benchmark perturbation (SDEval (Wang et al., 8 Aug 2025)), proactive risk assessment (PaSBench (Yuan et al., 23 May 2025)), and situations where safety is only determinable through joint reasoning over context (VLSU (Palaskar et al., 21 Oct 2025), MSSBench (Zhou et al., 2024)). This shift from static, single-turn, text-only benchmarks reflects the need to mirror complex, real-world use cases.
2. Dataset Construction, Annotation, and Generation Techniques
Benchmark construction involves a combination of curated manual annotation, semi-automatic data generation, and adversarial example mining.
Manual and LLM-Assisted Annotation: MMSafeAware (Wang et al., 16 Feb 2025) and VLSU (Palaskar et al., 21 Oct 2025) employ multi-stage human expert annotation, with agreement metrics (e.g., inter-annotator F1 > 91%) and policies mandating that both image and text are jointly rated for safety and for nuanced “borderline” severity cases.
Synthetic and Privacy-Preserving Data Generation: SynSHRP2 (Shi et al., 6 May 2025) uses a pipeline of super-resolution, semantic segmentation, ControlNet-guided Stable Diffusion, and IP-Adapter for de-identified dash-cam synthesis, preserving traffic semantics while eliminating PII. The synthesis process produces crash and near-crash sequences with temporally consistent, privacy-safe keyframes and kinematics.
Adversarial and Dynamic Augmentation: SDEval (Wang et al., 8 Aug 2025) dynamically perturbs safety suites at the text (paraphrase, typo, mixed-language), image (augmentation, style transfer, caption-guided regeneration), and cross-modal (text-image injection, multimodal jailbreaks) levels, yielding benchmarks that target model vulnerabilities unseeable in static datasets and mitigate training data contamination.
Multi-Turn Dialogue Expansion: SafeMT (Zhu et al., 14 Oct 2025) and MTMCS-Bench (Liu et al., 11 Jan 2026) generate, for every risky seed, paired motifs of safe and unsafe dialogues under multiple escalation and context-switch attack strategies, ensuring parity and stress-testing dialogue-tracking resilience.
Automated Quality Control and Attacker-in-the-Loop: USB (Zheng et al., 26 May 2025) and OutSafe-Bench (Yan et al., 13 Nov 2025) implement multi-step data vetting, including LLM-based rejection for low-aggressiveness prompts, attack-boosting loops to ensure successful model compromise, and content alignment verifiers.
3. Evaluation Protocols and Quantitative Metrics
Benchmarks define multi-dimensional metrics to quantify safety robustness, over-blocking, adversarial vulnerability, and cross-modal generalization.
Core Metrics:
- Attack Success Rate (ASR): Fraction of harmful test cases that elicit unsafe outputs under evaluation; lower is safer (Zheng et al., 26 May 2025, Ying et al., 2024, Yan et al., 13 Nov 2025). Some variants factor out comprehension failures (C-ASR in Omni-SafetyBench) (Pan et al., 10 Aug 2025).
- Oversensitivity/Refusal Rate (ARR, FPR): Fraction of harmless cases wrongly refused; measures over-blocking (Zheng et al., 26 May 2025, Wang et al., 16 Feb 2025).
- Safety Risk Index (SRI): SRI = aggregates threat severity over all items, as averaged from jury LLMs (Ying et al., 2024).
- Multidimensional Cross Risk Score (MCRS): Combines per-category risk severity using a cross-risk influence matrix to account for correlated failure modes (Yan et al., 13 Nov 2025).
- Cross-Modal Safety Consistency Score (CMSC): , where is the standard deviation of modality-specific safety-scores, promotes stable defense across format shifts (Pan et al., 10 Aug 2025).
- Proactive Safety Index (SI): In SafeMT, SI penalizes early-turn model failures and reward consistency over dialogue — (Zhu et al., 14 Oct 2025).
Evaluation Protocols:
- Jury Deliberation: SafeBench (Ying et al., 2024) evaluates each sample independently with a five-model LLM “jury,” using chain-of-thought rationales, peer broadcast/revision, and majority voting.
- FairScore: OutSafe-Bench (Yan et al., 13 Nov 2025) aggregates reviewer model risk ratings via accuracy-weighted jury voting, ensuring robustness to single-model bias.
- Dynamic Difficulty Adaptation: SDEval (Wang et al., 8 Aug 2025) includes alpha-tuned perturbation gates for controlled adjustment of data difficulty.
- Multi-Agent Decomposition: MSSBench (Zhou et al., 2024) and VLSU (Palaskar et al., 21 Oct 2025) analyze error sources by explicit breakdown into vision understanding, intent reasoning, and fusion, with modular evaluation pipelines.
4. Benchmark Tasks and Scenario Coverage
Benchmarks span a diverse set of tasks indicative of real-world, safety-critical use:
| Benchmark | Modalities | Key Tasks / Safety Scenarios |
|---|---|---|
| SynSHRP2 | Tabular, video, | Event attribute classification, scene understanding, crash |
| image, kinematics | severity, incident/conflict detection | |
| USB | Text, image | Cross-modal attack/oversensitivity in 61 sub-categories |
| OutSafe-Bench | Text, image, | Multicategory content detection (privacy, violence, etc.) |
| audio, video | with MCRS metric, cross-modal consistency | |
| Video-SafetyBench | Video+text | Detect unsafe completions under harmful & benign prompts |
| Omni-SafetyBench | All (text, image, | Multi-format attack/defense, cross-modal safety consistency |
| audio, video) | ||
| AccidentBench | Video | Temporal, spatial, intent reasoning across accident/traffic |
| Automotive-ENV | Image, GUI, GPS | Explicit/implicit control, region-aware safety adaptation |
| SafeMT, MTMCS | Image, text | Multi-turn escalation/context switch, dialogue ASR vs. SI |
| SDEval | Image, text | Dynamic distribution shift injection, stress-testing |
| MSSBench, VLSU | Image, text | Situational risk where context determines safety |
Scenarios include but are not limited to:
- Physically hazardous activities, explicit/implicit violence, weapon manufacture, self-harm, privacy and PII leaks, legal/financial/deceptive instruction, offensive/biased content, medical misinformation, political manipulation, and proactive everyday risks (PaSBench (Yuan et al., 23 May 2025)).
5. Benchmark Findings, Common Failure Modes, and Insights
Safety–Utility Trade-off: Persistent across studies, models that refuse more often over-block, hurting helpfulness, while permissive models miss subtler unsafe cues (VLSU (Palaskar et al., 21 Oct 2025), USB (Zheng et al., 26 May 2025), MTMCS-Bench (Liu et al., 11 Jan 2026)). No current model achieves low ASR and ARR simultaneously.
Blind Spots in Cross-Modal Reasoning: Models excel at unimodal safe/unsafe cues (>90%), but accuracy drops to 20–55% when fusion is required, especially on cases where each modality appears safe in isolation but the combination is hazardous (VLSU (Palaskar et al., 21 Oct 2025)). About 34% of such misclassifications occur despite both unimodal predictions being correct, underscoring compositionality failures.
Temporal and Proactive Risk: Models are more susceptible to temporally extended, referential, or multi-turn attacks (SafeMT, Video-SafetyBench, AccidentBench), and proactive risk remains a challenge even when underlying safety knowledge is present (PaSBench (Yuan et al., 23 May 2025)).
Dialogue and Escalation: ASR increases with number of turns (SafeMT (Zhu et al., 14 Oct 2025)). Moderation modules (e.g., ChatShield) and prompt-adaptive shields reduce 8-turn ASR by up to 20 pp, yet the best models plateau at ~57% safe response rate to daily-life hazards (SaLAD (Lou et al., 7 Jan 2026)).
Over/Under-Blocking and Calibration: Alignment methods (VLGuard, MIS, SPA-VL) may raise refusal rate but still miss context-dependent hazards or issue vague warnings (SaLAD (Lou et al., 7 Jan 2026), MMSafeAware (Wang et al., 16 Feb 2025)). Improvements in explicit chain-of-thought or modular agent pipelines bolster safety-awareness but yield diminishing returns above 75-80% on best-in-class MLLMs (MSSBench (Zhou et al., 2024), SafeBench (Ying et al., 2024)).
Complex Modality Attacks: Safety rapidly degrades in full multimodal settings (Omni-SafetyBench), with multi-modal attacks (e.g., audio+image+text) bypassing single-modality filters and yielding lowest CMSC-scores (median ~0.6–0.8 vs. unimodal ~0.9) (Pan et al., 10 Aug 2025).
6. Benchmark Development, Best Practices, and Outlook
Principled Design Strategies:
- Simultaneous evaluation of vulnerability and oversensitivity (Zheng et al., 26 May 2025).
- Dynamic or adversarial augmentation to keep pace with model advances (Wang et al., 8 Aug 2025).
- Use of chain-of-thought inspection, structured reward, and explicit cross-modal cues in reward shaping (Yi et al., 8 Oct 2025, Lou et al., 10 May 2025).
- Emphasis on proactive, not just reactive, risk recognition (Yuan et al., 23 May 2025).
- Multimodal jury or multi-reviewer protocols to assure robustness against judgment bias (Ying et al., 2024, Yan et al., 13 Nov 2025).
- Granular scenario coverage (proactive, multi-turn, time-series, situational, GUI/context-adaptive).
Limitations and Future Directions:
- Scaling scenario diversity and data size (USB’s synthetic augmentation, SafeBench’s LLM-guided risk taxonomy updates).
- Extension to longer, real-world or open-ended sequences (AccidentBench, PaSBench).
- Integrated evaluation for aligned, context-aware, and cross-modally robust models.
- Real-time adjudication with dynamic multi-agent/jury composition and human-in-the-loop for ambiguous or high-stakes output.
- Incorporation of fine-grained, severity-weighted metrics and adversarial challenge suites to address rare or emerging risks.
Summary Table: Representative Benchmarks and Their Safety Axes
| Benchmark | Modalities | Key Safety Axes | Evaluation Highlights |
|---|---|---|---|
| USB | Text, image | Vulnerability & oversensitivity; 61 categories | Dual metrics (ASR, ARR), coverage gap analysis |
| OutSafe-Bench | Text, image, audio, video | 9 risk categories, cross-risk scoring | MCRS, FairScore multi-LLM reviewer |
| Video-SafetyBench | Video+text | Benign referential, temporal | RJScore, domain calibration |
| SafeBench | Text, image, audio | 23 scenario taxonomy, jury protocol | LLM jury, SRI/ASR |
| SafeMT/MTMCS | Image, text | Multi-turn escalation/context | Safety Index (SI), dialogue spectrum |
| SDEval | Image, text | Dynamic perturbation | Distributional drift, leakage stress |
| MSSBench/VLSU | Image, text | Situational/joint reasoning | 17 pattern taxonomy, over/under-blocking |
In current practice, no single benchmark is sufficient for exhaustive safety assessment. Comprehensive evaluation suites must jointly address modality, risk, scenario, and real-world compositionality, continuously refreshed to match the evolving risk landscape and model capabilities (Zheng et al., 26 May 2025, Palaskar et al., 21 Oct 2025, Yan et al., 13 Nov 2025, Wang et al., 8 Aug 2025).