Modality-Aware Evaluation in Multimodal ML

Updated 20 January 2026

Modality-aware evaluation is a framework that quantifies and balances contributions of visual, auditory, text, and other sensory inputs using metrics like MIS, MDI, and V-Ratio.
It employs controlled benchmarks, systematic noise injection, and missing-modality simulations to assess model fusion reliability and expose modality biases.
Diagnostic protocols and representation engineering techniques intervene in modality preferences to enhance accuracy and mitigate hallucinations in multimodal systems.

Modality-aware evaluation refers to the rigorous, quantitative assessment of multimodal machine learning systems under varying input conditions and evidence structures, with the aim of determining each modality’s contribution, preference, confidence, and dominance during inference and optimization. This paradigm is pivotal as models increasingly integrate visual, auditory, textual, and other sensory data, and there is a need to ensure both robust fusion and fair benchmarking, avoiding unimodal shortcuts. Recent research has equipped the field with principled metrics—such as Modality Importance Score (MIS) (Park et al., 2024), Modality Dominance Index (MDI) (Liu et al., 2 Jan 2026), vision/text ratio (Zhang et al., 27 May 2025), and modality-wise confidence rankings (Zou et al., 2024)—and structured protocols for dataset curation, model diagnosis, and fusion reliability estimation.

1. Quantifying Modality Contribution: MIS, MDI, and Preference Ratios

Central to modality-aware evaluation are metrics that disentangle and quantify each modality’s actual impact on model outputs. The Modality Importance Score (MIS) (Park et al., 2024) is defined for a question $q_i$ over modalities $M = \{m_1, \dotsc, m_k\}$ as:

$\mathrm{MIS}^i_{m_j} = \mathrm{perf}(q_i \mid M_j^+) - \mathrm{perf}(q_i \mid M_j^-)$

Where $\mathrm{perf}(q_i \mid \mathcal{S})$ is the average accuracy when the model answers $q_i$ using modality subset $\mathcal{S}$ , $M_j^+$ includes $m_j$ , $M_j^-$ excludes it. A positive MIS implies performance gain by including $m_j$ , negative implies interference. For optimization analysis, the Modality Dominance Index (MDI) (Liu et al., 2 Jan 2026) utilizes both feature entropy $D(F_x)$ and gradient sensitivity $R(F_x, L_x^{aux})$ :

$S_x = \delta \cdot D(F_x) + (1-\delta) \cdot R(F_x, L_x^{aux})$

A higher $S_x$ indicates greater dominance by modality $x$ during training.

In large multimodal LLMs (MLLMs), the Vision-Ratio (V-Ratio) (Zhang et al., 27 May 2025) is used to measure systemic bias:

$\text{V-Ratio} = \frac{S^v}{S^v + S^t}$

where $S^v$ is the number of vision-consistent predictions, and $S^t$ the number of text-consistent predictions. A V-Ratio $>0.5$ denotes vision bias, $<0.5$ denotes text bias.

2. Benchmark Construction and Protocols for Rigorous Modality-Aware Testing

Robust modality-aware evaluation requires datasets that induce and expose bias or preference. The MC² benchmark (Zhang et al., 27 May 2025) utilizes controlled evidence conflict scenarios where images and text support contradictory answers, validated for unimodal solvability and true conflict via automated and manual checks. Evaluation on MC² quantifies whether models favor vision or text, controlling for confounders and task complexity.

For robustness testing, protocols include systematic noise injection and missing-modality simulation (Zou et al., 2024). For example:

Gaussian Noise is added to image inputs at set $\sigma$ values to degrade modality fidelity.
Missing-Modality scenarios are simulated by zeroing out the input tensor of a modality.

These scenarios expose whether models disproportionately rely on a single modality or maintain robust fusion.

3. Diagnosing and Manipulating Modality Preferences in MLLMs

Beyond observational metrics, modality-aware evaluation extends to probing and intervention. In MLLMs, latent modality preference vectors are discovered by contrasting hidden states under explicit text vs. vision cues (Zhang et al., 27 May 2025):

$u^\mathrm{t}_\ell = \frac{1}{N} \sum_{i=1}^{N} \left(x^\mathrm{t}_{i,\ell} - x^\mathrm{v}_{i,\ell}\right)$

where $x^\mathrm{t}_{i,\ell}$ , $x^\mathrm{v}_{i,\ell}$ represent layer- $\ell$ activations under text/vision guides. By injecting scaled $u^\mathrm{t}_\ell$ into decoder layers, preference for either modality can be amplified at inference time without model re-training. This method delivers superior control compared to prompt engineering or few-shot learning.

Empirical results show that steering via representation engineering achieves higher accuracy in targeted tasks and reduces hallucination by shifting preference toward the appropriate modality.

4. Fusion Confidence, Uncertainty, and Robustness Metrics

Evaluation of multimodal systems increasingly incorporates confidence-aware methods. In multi-modal eye disease screening (Zou et al., 2024), output uncertainties are modeled using Normal–Inverse–Gamma (NIG) priors per modality, yielding both aleatoric and epistemic uncertainty. Fusion employs a mixture of Student’s t distributions (MoSₜ), with the fused parameters weighted by modalities’ degrees of freedom and confidence metrics:

$u_F = C_1 u_1 + C_2 u_2, \quad \Sigma_F = \frac{1}{2}\left[\Sigma_1 + \frac{v_2/(v_1-2)}{v_2-2} \Sigma_2\right]$

A ranking regularizer enforces that fused confidence always meets or exceeds unimodal confidence on correct predictions, addressing over-confidence under noise and missing-modalities.

Benchmark evaluation proceeds via accuracy, Cohen’s $\kappa$ , expected calibration error (ECE), and area under risk–coverage curve (AURC), revealing both raw and risk-adjusted reliability under degraded and OOD inputs.

5. Empirical Insights into Modality Bias, Complementarity, and Dataset Curation

Structured evaluation across datasets reveals that the majority of questions (~50–90%) are either unimodal-biased or modality-agnostic correct (Park et al., 2024). Genuinely complementary items—where two modalities are required for correct inference—are rare (<3%). Detailed breakdowns from TVQA, LifeQA, and AVQA show:

Dataset	Subtitle-biased (%)	Video-biased (%)	Complementary (%)	Modality-agnostic correct (%)
TVQA	22.0	33.9	2.1	35.1
LifeQA	19.9	33.6	2.4	36.3
AVQA	4.9	11.7	0.6	78.5

This pattern justifies efforts to curate modality-balanced benchmarks. By scoring potential items via MIS, filtering for high joint-modal importance, and authoring questions with required multimodal integration, new datasets can prioritize genuine cross-modal reasoning.

6. Practical Methodologies for Fusion Design and Optimization Control

Modality-aware evaluation informs not only curation and analysis but direct intervention in model training. Using MDI-driven signals, fusion weights in RGB-IR models are dynamically regulated (Liu et al., 2 Jan 2026):

If $S_{RGB} \gg S_{IR}$ , inverse weighting and Hierarchical Cross-modal Guidance (HCG) are deployed to enhance IR’s role.
Adversarial Equilibrium Regularization (AER) further balances optimization dynamics.
Logging MDI over time allows early detection of persistent bias, prompting data augmentation or loss strengthening for underrepresented modalities.

Experimental evidence across SOTA benchmarks demonstrates that inverse weighting guided by MDI yields maximum accuracy (mAP50), with gradient bias correlating negatively with performance.

7. Best-Practice Guidelines and Future Directions

Study results advocate for systematic, fine-grained evaluation protocols encompassing:

Item-wise MIS/MDI computation,
Controlled conflict and missing-modality testing,
Multiparametric risk assessment (accuracy, calibration, coverage),
Confidence-aware fusion and ranking regularization.

A plausible implication is that modality-aware evaluation will be necessary for designing truly multimodal models, guiding both architecture selection and curriculum. Limitations include scaling fusion and ranking methods to more than two modalities, integrating additional reliability metrics, and expanding artifact simulation to real-world conditions.

Ongoing work will likely extend to hierarchical mixture models, active learning under high epistemic uncertainty, and application to regression or segmentation tasks, leveraging modality-aware evaluation as a foundation for robust and balanced multimodal intelligence.