Papers
Topics
Authors
Recent
Search
2000 character limit reached

Modality-Balanced Dataset

Updated 1 January 2026
  • Modality-balanced datasets are multimodal corpora designed to ensure that each modality (vision, audio, text) contributes equally, preventing dominance of any single input.
  • Construction methods include uniform distribution of sample-wise discrepancies, dynamic weighting strategies, and adversarial data generation to maintain balance.
  • Such datasets enable robust evaluation and improve model generalization by reducing bias, lowering hallucination rates, and exposing failures in multimodal fusion techniques.

A modality-balanced dataset is a multimodal benchmark, corpus, or distillation surrogate designed such that the informational or statistical contribution of each modality (e.g., vision, audio, text) is quantitatively balanced. This concept is critical for robust evaluation, algorithm development, and foundation model pretraining, as it prevents models from overfitting dominant modalities and ensures generalizable cross-modal reasoning.

1. Formal Definitions and Quantitative Metrics

Modality balance can be mathematically characterized at both the dataset and model levels. Principal metrics include sample-wise modality discrepancy, Shapley-value-based imbalance scoring, and objective-specific modality weights.

Di=uˉi(a)uˉi(v)D_i = \bar{u}_i^{(a)} - \bar{u}_i^{(v)}

where uˉi(k)\bar{u}_i^{(k)} is the average confidence of the kkth unimodal classifier on sample ii (k=ak = a for audio, vv for visual).

  • Imbalance Degree (Shapley-based) (Xu et al., 15 Feb 2025):

    • For MM modalities, the Shapley contribution ϕi\phi^i of modality ii is:

    ϕi=1M!πΠM[v(Sπi{i})v(Sπi)]\phi^i = \frac{1}{|M|!} \sum_{\pi \in \Pi_M} \left[v(S^i_\pi \cup \{i\}) - v(S^i_\pi)\right]

    where v(A)v(A) is the model accuracy using modalities AA. - The overall imbalance score:

    I=1(M2)i<jϕiϕj\mathcal{I} = \frac{1}{\binom{M}{2}} \sum_{i < j} |\phi^i - \phi^j|

    For bimodal case: I=ϕ1ϕ2\mathcal{I} = |\phi^1 - \phi^2|.

  • Objective-specific Modality Weights ωm\omega_m (Nie et al., 16 Nov 2025):

ωm=1BmbBmz=1Zpz,mG~z,b\omega_m = \frac{1}{|\mathcal{B}_m|} \sum_{b \in \mathcal{B}_m} \sum_{z=1}^Z p_{z,m} \tilde{G}_{z,b}

with normalization mωm=1\sum_m \omega_m = 1.

A modality-balanced dataset ensures (a) the distribution of DiD_i is approximately uniform (typically across [1,1][-1,1]), or (b) the global imbalance score I\mathcal{I} is minimized, typically 0.1\leq 0.1 for high balance.

2. Principles and Motivations for Modality Balance

Multimodal models often default to the dominant modality in the absence of balanced data, leading to modality collapse or bias (Zhang et al., 16 May 2025, Liu et al., 20 May 2025). Modality imbalance manifests as:

  • Over-concentration of representations within one modality, i.e., large intra-modal similarity and inter-modal distributional gap.
  • Answer or label frequency skew, causing models to learn spurious correlations (e.g., always answering "yes" to a majority binary template) (Liu et al., 2023).
  • Increased hallucination rates in generative models when language priors outweigh grounded perceptual signals (Liu et al., 20 May 2025).

Balanced datasets provide stress tests across degrees of modality agreement, revealing underperformance of fusion methods on high-discrepancy samples (Xia et al., 2023). Modality-balance is thus foundational for the development of generalizable and fair multimodal models.

3. Construction Methodologies

Uniform Distribution of Discrepancy

BalancedAudiovisual (Xia et al., 2023) constructs a dataset where per-sample discrepancy DiD_i is uniformly distributed by:

  • Measuring DiD_i via pre-trained unimodal models.
  • Stratifying clips into high correspondence, audio-only, and visual-only correct.
  • Performing long-tail correction: oversampling underrepresented bins and removing redundant samples to flatten the DiD_i histogram.

Answer Frequency Balancing

MUSIC-AVQA v2.0 (Liu et al., 2023) achieves answer balance by:

  • Identifying templates where maxipt,iτ\max_i p_{t,i} \geq \tau (τ=0.6\tau=0.6 for binary, $0.5$ for k-way classification).
  • For each skewed template, collecting new QA pairs for deficit answers to enforce Nt,yesNt,no1|N_{t,yes} - N_{t,no}| \leq 1, producing p0.5p \approx 0.5.

Dataset-Level and Training-Level Balancing

  • Explicit Multimodal Unit Counts: VAST-27M (Chen et al., 2023) ensures every video clip contains exactly one vision, audio, and subtitle track, and fixed numbers of captions for each, with uniform sampling and strict filtering for modality completeness.
  • Dynamic Weighting: MBE2.0 (Nie et al., 16 Nov 2025) employs adaptive modality weights (ωm\omega_m) and dynamic sample filtering to enforce both global and local balance across image-only, text-only, and multimodal samples.

Adversarial and Policy-Based Data Generation

MBPO (Liu et al., 20 May 2025) constructs balanced preference datasets for LMM alignment via:

  • Hard negative mining: generating "loser" responses by adversarial perturbation of images to elicit language-biased answers, paired with visually grounded "winner" responses.
  • Keeping equal representation of high and low Image Information Gain samples, resulting in a dataset with balanced modality utilization signals.

4. Representative Datasets and Benchmarks

Dataset/Benchmark Modalities Balance Method
BalancedAudiovisual Audio + Visual Uniform DiD_i bins
MUSIC-AVQA v2.0 Audio + Visual + Text (QA) Balanced answer freq.
VAST-27M Vision + Audio + Subtitle + Text Fixed tracks/captions
BalancedAV Audio + Visual Minimizing I\mathcal I
MBE2.0 Text + Image + Multimodal Dynamic ωm\omega_m

BalancedAV (Xu et al., 15 Feb 2025) is constructed to yield the lowest empirical imbalance score (I0.1\mathcal I \sim 0.1) among standard benchmarks, whereas CREMA-D exhibits the highest (I0.30.4\mathcal I \sim 0.3-0.4), confirming that naive dataset assembly often creates strong modality dominance.

5. Evaluation and Analysis Protocols

Performance of algorithms on modality-balanced datasets is evaluated via overall, modality-preferred, and high-discrepancy subset accuracies (Xia et al., 2023). Key findings include:

  • No multimodal fusion algorithm outperforms the unimodal baseline on hard (high-discrepancy) subsets.
  • Modality-balanced datasets are necessary to expose such failure modes and to evaluate how well fusion algorithms or preference-tuned LMMs actually use underrepresented modalities.
  • Balanced pretraining, as in VAST-27M (Chen et al., 2023), yields foundation models that surpass state-of-the-art performance across all tested modality pairings, confirming that construction protocol and groupwise cycling sustain balance through training.

Additional metrics include per-modality retrieval accuracy, response bias suppression, hallucination rates (e.g., CHAIR metrics in MBPO (Liu et al., 20 May 2025)), and empirical decrease in answer prior exploitation (Liu et al., 2023).

6. Recommendations and Impact

Research recommends systematic reporting of stratified performance by modality discrepancy (Xia et al., 2023), adoption of Shapley-value or permutation-invariant metrics for dataset curation (Xu et al., 15 Feb 2025), and providing spectrum splits of imbalances. Overemphasis on absolute balance may be suboptimal; relative balance—modality contributions being commensurate with their unique information—is preferable (Xu et al., 15 Feb 2025). Hierarchical balancing, dynamic reweighting during training, and robust evaluation protocols are advocated in new benchmarks such as MBE2.0.

The construction and deployment of modality-balanced datasets are essential for understanding, benchmarking, and pretraining next-generation multimodal and foundation models, particularly as multi-track or omni-modality corpora become the norm in both academic and real-world settings.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Modality-Balanced Dataset.