Multimodal Toxicity Covertness in LMMs

Updated 10 February 2026

Multimodal Toxicity Covertness (MTC) is defined as hidden toxicity that emerges only when text and image modalities are jointly analyzed.
It incorporates metrics like the Hidden Toxicity (HT) metric and TAG-based MTC scores to quantify cross-modal toxicity concealed in individual benign modalities.
Detection and mitigation methods, including TAG-assisted analysis, neuron-level detoxification, and adversarial red-teaming, are actively shaping research in safeguarding LMMs.

Multimodal Toxicity Covertness (MTC) captures the phenomenon whereby toxic, prejudicial, or harmful implications in large multimodal models (LMMs) are concealed unless both text and image (or other modalities) are jointly considered. In MTC scenarios, neither the textual nor the visual channel alone contains overtly toxic cues; only their combination reveals discrimination, prejudice, or offensive interpretations, eluding standard uni-modal safety tools. Recent research establishes MTC as a critical challenge for model safety, motivating quantitative metrics, dedicated benchmarks, and both offensive and defensive methodologies to probe and manage this hidden toxicity.

1. Formal Definition and Taxonomy

MTC is defined as the degree to which a potentially harmful multimodal meaning is concealed, manifesting only through cross-modal interpretation. In MDIT-Bench, this is termed “dual-implicit toxicity”: a scenario where both text $T$ and image $V$ are individually non-toxic ( $s_\mathrm{text}(T)=1,\, s_\mathrm{image}(V)=1$ ), but their combination $f((T,V))=1$ yields toxic content (Jin et al., 22 May 2025, Cui et al., 20 May 2025).

The ShieldVLM taxonomy organizes multimodal implicit toxicity (MTC in this context) into seven risk categories: Offensive, Discrimination & Stereotype, Physical Harm, Illegal Activities, Morality Violation, Private Property & Privacy, and Misinformation. Each category is further split into sub-categories; for example, “Discrimination & Stereotype” includes Race, Gender, Religion, and Orientation Discrimination, among others (Cui et al., 20 May 2025).

MTC thus linguistically, contextually, and semantically encapsulates scenarios where toxicity is only unlockable by cross-modal reasoning, which can occur through mechanisms such as:

Semantic drift
Contextualization (e.g., “jokes” told at a funeral)
Metaphor (visual or textual)
Implication (action presupposing malice)
Knowledge-dependent cues (cultural, common-sense)

2. Measurement and Metrics

Recent advances provide formal metrics for quantifying MTC. Two principal strands emerge:

A. Hidden Toxicity (HT) Metric

Introduced in MDIT-Bench, the HT metric directly quantifies latent model toxicity unmasked by adversarial context injection. For a model $\mathcal{G}$ , with $Acc_{n=0}$ the accuracy on standard dual-implicit tasks and $Acc_{n=i}$ under $i$ -shot toxic demonstrations, HT is:

$HT(\mathcal{G}) = \sum_{i\in N}\left(1-\frac{Acc_{n=i}}{Acc_{n=0}}\right) \mathrm{Norm}_N(i)$

$\mathrm{Norm}_N(i) = \frac{1/\log_2 i}{\sum_{j\in N}1/\log_2 j}$

with $N=\{32,64,128\}$ (Jin et al., 22 May 2025). HT reflects the fraction of hidden toxicity activated by extensive context priming.

B. Multimodal Toxicity Covertness (MTC) Score via Toxicity Association Graphs (TAGs)

A graph-theoretic metric, MTC is computed as follows (Wu et al., 3 Feb 2026):

Construct association trees for text $T^{\mathbf{x}}$ (image) and $T^{\mathbf{t}}$ (text), connect in a cross-modal bipartite graph $B^{\mathbf{x},\mathbf{t}}$ .
Identify the shallowest toxic node pair $(v^x_i, v^t_j)\in \mathcal{S}$ , where $\mathcal{S}$ is an oracle or mined toxic association set.
Compute the joint probability of traversing from roots to this toxic pair: $\hat{p}_{ij} = p_x(i)\cdot p_t(j)$ , where each path probability is the product of edge transition probabilities.
Define the covertness as $c=1-\hat{p}_{ij} \in [0,1]$ .
Empirically, $c$ is partitioned into low $[0,0.2)$ , medium $[0.2,0.8)$ , and high $[0.8,1]$ covertness.

This value quantifies how deeply concealed the toxic association is: $c=1$ implies high covertness (hard to flag), $c=0$ is overt (trivial to detect).

3. Datasets and Benchmarks

Multiple resources have been introduced to surface MTC:

Benchmark	Scope	Structure	Sample Size
MDIT-Bench	Dual-implicit toxicity	5-way MCQ; 12 categories, 780 topics	317,638 items
MMIT-Dataset	Multimodal implicit toxicity	7 categories, 31 sub-categories	2,100 instances
Covert Toxic Dataset	High-covertness toxic samples	Designed for high MTC evaluation	- (size in (Wu et al., 3 Feb 2026))

MDIT-Bench uses a multistage human-in-the-loop pipeline: seeding with placeholder-based toxic question templates expanded via LLMs, web-image retrieval, distractor generation to test cross-modal alignment, and multiple-choice construction. Three graded tiers probe explicit (easy), dual-implicit (medium), and jailbroken (hard) MTC. Model accuracy typically drops from 85–90 % (easy) to 40–70 % (medium); under extensive adversarial context, the hidden toxicity metric HT reaches ≈ 0.48 for leading models (Jin et al., 22 May 2025).

Covert Toxic Dataset encodes nuanced image-text associations and is used to empirically validate the statistical range and interpretability of the graph-based MTC score (Wu et al., 3 Feb 2026).

MMIT-Dataset leverages scenario decomposition and adversarial synthesis to curate balanced, category-rich multimodal implicit toxicity samples, challenging state-of-the-art moderation APIs and LMMs (Cui et al., 20 May 2025).

4. Methodologies for Detection and Red-Teaming

Approaches for probing and elucidating MTC fall into interpretability, adversarial, and reasoning-centered paradigms:

TAG-Assisted Detection (TA-CTD): Employs semantic graphs per modality. These graphs systematically encode semantic associations from benign entities to toxic implications, enabling computation of the MTC score and generation of natural-language justifications via semantic path tracing (Wu et al., 3 Feb 2026).
Deliberative Reasoning (ShieldVLM): ShieldVLM explicitly models step-wise reasoning—first parsing each modality, then synthesizing cross-modal scenarios and scoring against safety guidelines to determine toxicity and risk category. Reasoning is supervised, token-by-token, enabling the model to explicate which cross-modal mechanism (drift, metaphor, implication, knowledge, contextualization) unlocks hidden toxicity (Cui et al., 20 May 2025).
Neuron-Level Detoxification (SGM): White-box, neuron-suppression strategies identify multimodal “toxic expert” neurons (via AUROC across labeled examples), then apply soft suppression during inference, drastically reducing covert trigger success rates while minimally impacting reasoning (Wang et al., 17 Dec 2025).
Adversarial and Contextual Image Attacks (CIA, RTD):
- CIA utilizes multi-agent systems to embed subtle harmful queries into visual contexts, refining images via semantic drift minimization and automatic toxicity obfuscation. Attacks leverage instructional, metaphorical, and dialogic visual layouts, with toxicity scores reaching ≥ 4.7/5 and ASR exceeding 90 % in leading LMMs (Xiong et al., 2 Dec 2025).
- Red Team Diffuser (RTD) applies RL-based diffusion model adaptation, balancing toxicity amplification and semantic alignment, demonstrating that contextually adversarial images induce up to 10.7 % higher toxicity rates than text-only prompts, including on held-out and cross-model targets (Wang et al., 8 Mar 2025).

5. Empirical Findings and Limitations

Empirical investigations consistently reveal that state-of-the-art LMMs remain vulnerable to MTC:

Detection accuracy sharply degrades on dual-implicit or covert samples (e.g., Qwen2-VL-7B falls from >85 % easy to ≈41 % medium on MDIT-Bench (Jin et al., 22 May 2025)).
Covert context injection (“jailbreaking”) exposes a substantial reservoir of latent biases, even as models perform well on explicit toxicity.
Strictly uni-modal tools (including lexicon-based detectors) exhibit near-random performance on MTC benchmarks (Cui et al., 20 May 2025).
ShieldVLM, incorporating explicit multimodal reasoning, outperforms closed-source and prompt-engineered competitors on implicit cases, improving F1-score and interpretability (Cui et al., 20 May 2025).
Stealth-aware RL diffusion (RTD) and multi-stage visual attacks (CIA) both demonstrate that adversarial context can dramatically reduce separability between benign and toxic regions in embedding space, underscoring the risk of latent “boundary collapse” (Xiong et al., 2 Dec 2025, Wang et al., 8 Mar 2025).
Neuron-level interventions (SGM) can decrease harmful rates under covert triggers from ≈ 50.9 % to 10.5 % or even 4.4 % when integrated with further defenses (Wang et al., 17 Dec 2025).
Model reasoning chains are not always required or surfaced in current benchmarks, limiting introspection and traceability (Jin et al., 22 May 2025).

Documented limitations include:

Current benchmarks primarily target prejudice and discrimination, not privacy or other axes of harm.
Many datasets are synthetic or English-only, potentially omitting real-world variation (Jin et al., 22 May 2025, Cui et al., 20 May 2025).
Model and benchmark approaches often rely on LLM-generated content and human-in-the-loop curation, inheriting upstream biases (Jin et al., 22 May 2025).
Existing defenses may be circumvented by adaptive attacks or modalities not yet stress-tested (e.g., video, audio beyond OCR).

6. Detection, Mitigation, and Research Directions

Detection methods increasingly prioritize interpretability and quantification. TAG-based MTC scores allow for explicit thresholding and human-in-the-loop escalation (e.g., $c>0.8$ ), and provide a basis for system-level moderation policy (Wu et al., 3 Feb 2026). Chain-of-thought reasoning, as in ShieldVLM, operationalizes transparent step-by-step diagnostics and unlocks higher implicit toxicity detection performance (Cui et al., 20 May 2025).

Mitigation proposals include:

Fine-tuning on diverse, adversarial cross-modal bias cases.
Explicit targeting of dual-implicit scenarios in safety alignment (Jin et al., 22 May 2025).
Neuron-level, white-box intervention for rapid, hot-pluggable detoxification (Wang et al., 17 Dec 2025).
Incorporation of adversarial visual augmentations and contextually tuned reward functions in training pipelines (Xiong et al., 2 Dec 2025, Wang et al., 8 Mar 2025).
Automated bias correction during dataset generation and expansion to multi-modalities (video, audio).
Advancements in interpretability: mapping and tracing of model reasoning chains and activation clusters responsible for cross-modal toxicity (Jin et al., 22 May 2025, Wang et al., 17 Dec 2025).

Active areas for future research include generalizing covertness metrics beyond toxicity, developing certified multimodal defenses, integrating lightweight jailbreaking detectors for real-time inference, and systematizing blue-team adversarial purification procedures (Jin et al., 22 May 2025, Xiong et al., 2 Dec 2025, Wang et al., 8 Mar 2025).

7. Broader Significance and Open Questions

MTC highlights a foundational failure mode: LMMs judged safe in routine, uni-modal or supervised contexts can still be subverted—deliberately or inadvertently—via hidden, cross-modal interactions. The emerging taxonomy, evaluation frameworks, and intervention methods provide the necessary scaffolding for both scientific understanding and practical defense. However, questions remain about extending coverage to new modalities, safeguarding against evolving attack strategies, and maintaining transparency as model architectures grow in size and complexity. Research continues to expand the theoretical and empirical understanding of toxic covertness, seeking to build LMMs that are not just safe on the surface, but robustly immune to hidden cross-modal harm.

References:

(Jin et al., 22 May 2025) MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models
(Wu et al., 3 Feb 2026) Unveiling Covert Toxicity in Multimodal Data via Toxicity Association Graphs: A Graph-Based Metric and Interpretable Detection Framework
(Wang et al., 17 Dec 2025) SGM: Safety Glasses for Multimodal LLMs via Neuron-Level Detoxification
(Cui et al., 20 May 2025) ShieldVLM: Safeguarding the Multimodal Implicit Toxicity via Deliberative Reasoning with LVLMs
(Xiong et al., 2 Dec 2025) Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities
(Wang et al., 8 Mar 2025) Red Team Diffuser: Exposing Toxic Continuation Vulnerabilities in Vision-LLMs via Reinforcement Learning