Papers
Topics
Authors
Recent
Search
2000 character limit reached

Omni-Modal Safety Dataset Overview

Updated 28 January 2026
  • Omni-Modal Safety Dataset is a comprehensive multi-modality corpus integrating text, image, video, audio, and cross-modal samples to enhance safety in LLMs.
  • It features binary safety labels along with detailed multi-label annotations for risks like toxicity, harassment, and misinformation.
  • Structured, model-generated critiques via expert distillation enable interpretable guardrail models and reproducible benchmarking across 15 tasks.

The Omni-Modal Safety Dataset (also referred to as the OmniGuard safety dataset) is a large-scale, multi-modality, multi-label corpus designed to support research and deployment of guardrail models for safety in omni-modal LLMs (OLLMs). It comprises approximately 250,000 samples spanning text, image, video, audio, and cross-modal inputs, with each sample annotated for safety and accompanied by structured model-generated critiques. The dataset is intended to enable unified, interpretable safety judgments across a diverse range of content modalities and risk types, explicitly addressing limitations of prior unimodal safety datasets and binary-classification frameworks (Zhu et al., 2 Dec 2025).

1. Scope and Composition

The Omni-Modal Safety Dataset is curated for comprehensive coverage across modalities and risk types relevant to OLLM safety. The dataset contains approximately 249,530 samples after deduplication, with the following distribution:

Modality Number of Samples Major Source Datasets
Text 149,034 WildGuardMix, Aegis2.0, BeaverTails, ToxicChat
Image 14,679 UnsafeBench, LlavaGuard, VLGuard
Video 59,960 SafeSora, LSPD, TikHarm, DCSASS
Audio 23,617 MuTox-English, WildGuardMix-TTS
Cross-modal 2,240 VLSBench

The dataset consists predominantly of unimodal examples (≈99.1%), with a small proportion of cross-modal (image–text) samples (≈0.9%). Thematic groupings encompass a comprehensive taxonomy covering textual risks (e.g., toxicity, harassment, self-harm, criminal advice, misinformation), visual risks (e.g., nudity, violence, privacy leaks), video risks (e.g., dangerous behaviors, visual misinformation), audio risks (e.g., hate speech, privacy leakage), and cross-modal risks (e.g., concealed intent, multimodal misinformation) (Zhu et al., 2 Dec 2025).

2. Annotation Schema and Structured Labels

Annotations follow a two-stage structured schema:

  1. Binary Safety Label: Every sample receives a primary “safe” or “unsafe” label: y{safe,unsafe}y \in \{\text{safe}, \text{unsafe}\}.
  2. Multi-Label Violation Categories: If unsafe, each sample is further labeled with one or more violation categories C={c1,...,cm}C = \{c_1, ..., c_m\} from a unified policy guideline set. Categories include (but are not limited to): Self-Harm, Violence, Hate Speech, Criminal Advice, Privacy Leak, Misinformation, and Sexual Content.

No continuous severity ratings are assigned; all violation categories are treated as equally weighted multi-labels. Samples may be annotated with multiple risk categories.

3. Critique Generation and Expert Distillation

For each data instance, a detailed natural-language critique is generated via targeted expert-model distillation. The process leverages domain-appropriate large “teacher” models:

  • Text critiques: generated by gpt-oss-120B,
  • Visual (images, videos, image–text): Qwen3-VL-235B-A22B-Instruct,
  • Audio: Kimi-Audio-7B-Instruct.

Each teacher model is prompted in a structured fashion to deliver:

  • the assigned safety label,
  • relevant categories,
  • and a detailed reasoning explanation.

The resulting dataset contains quadruples (xi,yi,Ci,ei)(x_i, y_i, C_i, e_i) for each sample, supporting instruction-tuning and deliberate reasoning in downstream guardrail models (Zhu et al., 2 Dec 2025). No traditional human inter-annotator agreement is reported; consistency is enforced by the teacher-model generation process.

4. Data Collection and Curation Methodology

All primary data sources are public. Dataset composition includes both original and synthetic splits (with, for example, WildGuardMix-TTS for audio from TTS-conversion). The integration strategy avoids manual re-labeling, instead unifying taxonomy (e.g., video risk classes under SafeWatch-Bench labels) and discarding out-of-scope samples according to predefined policy guidelines.

Quality control relies on the consistency of teacher-model critiques rather than conventional human adjudication. Samples are excluded if their content is outside the predefined risk taxonomy or policy scope (Zhu et al., 2 Dec 2025).

5. Distributional Statistics

Key statistics are as follows:

Modality Count Fraction of Total
Text 149,034 ≈ 59.7%
Image 14,679 ≈ 5.9%
Video 59,960 ≈ 24.0%
Audio 23,617 ≈ 9.5%
Cross-modal 2,240 ≈ 0.9%
Total 249,530 100%

Breakdown by source datasets for each modality is explicit in the released dataset documentation (e.g., WildGuardMix for text, UnsafeBench for images, SafeSora for video, MuTox-English for audio).

6. Downstream Benchmarking and Use Cases

Evaluation covers 15 public benchmarks, selected to provide broad coverage across modalities and task types:

  • Text: BeaverTails, ToxicChat, WildGuardMix, Aegis2.0, OpenAI Moderation.
  • Image: UnsafeBench, VLGuard, LlavaGuard.
  • Video: SafeSora, SafeWatch-Bench.
  • Audio: MuTox-English, WildGuardMix-TTS.
  • Cross-modal: MM-SafetyBench (image+text), Video-SafetyBench (video+text), AIAH (audio+text).

Metrics used are Accuracy and F1 for both binary safety prediction and multi-label violation classification. In reported experiments, OMNIGUARD-7B outperforms leading baselines (GPT-4o, Qwen3-235B, etc.) by over 10 percentage points in average F1, and OMNIGUARD-3B matches or exceeds larger models. Notably, reasoning-augmented supervision (i.e., inclusion of distilled critiques during training) confers measurable improvements over label-only fine-tuning, and cross-modal generalization is high: unseen-modality F1 (79.4%\approx 79.4\%) approaches seen-modality F1 (81.8%\approx 81.8\%) (Zhu et al., 2 Dec 2025).

7. Reproducibility, Access, and Licensing

The Omni-Modal Safety Dataset and associated models are assembled from publicly available datasets, primarily released via HuggingFace or GitHub links. Licensing adheres to the original terms (Apache 2.0, CC BY, MIT, etc.), and practitioners should check each component’s restrictions before use.

Reproducibility is facilitated by detailed reporting of dataset composition, teacher models, prompt templates, and source-level sample counts. No new human annotation protocol or inter-annotator metrics are introduced; annotation inherits from upstream datasets. A small synthetic segment (notably WildGuardMix-TTS) is included to expand the audio modality (Zhu et al., 2 Dec 2025).


The Omni-Modal Safety Dataset constitutes a foundational resource for developing, training, and benchmarking unified guardrails in OLLMs, with a focus on scalable, interpretable, and policy-driven safety interventions across diverse data modalities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Omni-Modal Safety Dataset.