Modality-Balanced Dataset
- Modality-balanced datasets are multimodal corpora designed to ensure that each modality (vision, audio, text) contributes equally, preventing dominance of any single input.
- Construction methods include uniform distribution of sample-wise discrepancies, dynamic weighting strategies, and adversarial data generation to maintain balance.
- Such datasets enable robust evaluation and improve model generalization by reducing bias, lowering hallucination rates, and exposing failures in multimodal fusion techniques.
A modality-balanced dataset is a multimodal benchmark, corpus, or distillation surrogate designed such that the informational or statistical contribution of each modality (e.g., vision, audio, text) is quantitatively balanced. This concept is critical for robust evaluation, algorithm development, and foundation model pretraining, as it prevents models from overfitting dominant modalities and ensures generalizable cross-modal reasoning.
1. Formal Definitions and Quantitative Metrics
Modality balance can be mathematically characterized at both the dataset and model levels. Principal metrics include sample-wise modality discrepancy, Shapley-value-based imbalance scoring, and objective-specific modality weights.
- Sample-wise Modality Discrepancy () (Xia et al., 2023):
where is the average confidence of the th unimodal classifier on sample ( for audio, for visual).
- Imbalance Degree (Shapley-based) (Xu et al., 15 Feb 2025):
- For modalities, the Shapley contribution of modality is:
where is the model accuracy using modalities . - The overall imbalance score:
For bimodal case: .
- Objective-specific Modality Weights (Nie et al., 16 Nov 2025):
with normalization .
A modality-balanced dataset ensures (a) the distribution of is approximately uniform (typically across ), or (b) the global imbalance score is minimized, typically for high balance.
2. Principles and Motivations for Modality Balance
Multimodal models often default to the dominant modality in the absence of balanced data, leading to modality collapse or bias (Zhang et al., 16 May 2025, Liu et al., 20 May 2025). Modality imbalance manifests as:
- Over-concentration of representations within one modality, i.e., large intra-modal similarity and inter-modal distributional gap.
- Answer or label frequency skew, causing models to learn spurious correlations (e.g., always answering "yes" to a majority binary template) (Liu et al., 2023).
- Increased hallucination rates in generative models when language priors outweigh grounded perceptual signals (Liu et al., 20 May 2025).
Balanced datasets provide stress tests across degrees of modality agreement, revealing underperformance of fusion methods on high-discrepancy samples (Xia et al., 2023). Modality-balance is thus foundational for the development of generalizable and fair multimodal models.
3. Construction Methodologies
Uniform Distribution of Discrepancy
BalancedAudiovisual (Xia et al., 2023) constructs a dataset where per-sample discrepancy is uniformly distributed by:
- Measuring via pre-trained unimodal models.
- Stratifying clips into high correspondence, audio-only, and visual-only correct.
- Performing long-tail correction: oversampling underrepresented bins and removing redundant samples to flatten the histogram.
Answer Frequency Balancing
MUSIC-AVQA v2.0 (Liu et al., 2023) achieves answer balance by:
- Identifying templates where ( for binary, $0.5$ for k-way classification).
- For each skewed template, collecting new QA pairs for deficit answers to enforce , producing .
Dataset-Level and Training-Level Balancing
- Explicit Multimodal Unit Counts: VAST-27M (Chen et al., 2023) ensures every video clip contains exactly one vision, audio, and subtitle track, and fixed numbers of captions for each, with uniform sampling and strict filtering for modality completeness.
- Dynamic Weighting: MBE2.0 (Nie et al., 16 Nov 2025) employs adaptive modality weights () and dynamic sample filtering to enforce both global and local balance across image-only, text-only, and multimodal samples.
Adversarial and Policy-Based Data Generation
MBPO (Liu et al., 20 May 2025) constructs balanced preference datasets for LMM alignment via:
- Hard negative mining: generating "loser" responses by adversarial perturbation of images to elicit language-biased answers, paired with visually grounded "winner" responses.
- Keeping equal representation of high and low Image Information Gain samples, resulting in a dataset with balanced modality utilization signals.
4. Representative Datasets and Benchmarks
| Dataset/Benchmark | Modalities | Balance Method |
|---|---|---|
| BalancedAudiovisual | Audio + Visual | Uniform bins |
| MUSIC-AVQA v2.0 | Audio + Visual + Text (QA) | Balanced answer freq. |
| VAST-27M | Vision + Audio + Subtitle + Text | Fixed tracks/captions |
| BalancedAV | Audio + Visual | Minimizing |
| MBE2.0 | Text + Image + Multimodal | Dynamic |
BalancedAV (Xu et al., 15 Feb 2025) is constructed to yield the lowest empirical imbalance score () among standard benchmarks, whereas CREMA-D exhibits the highest (), confirming that naive dataset assembly often creates strong modality dominance.
5. Evaluation and Analysis Protocols
Performance of algorithms on modality-balanced datasets is evaluated via overall, modality-preferred, and high-discrepancy subset accuracies (Xia et al., 2023). Key findings include:
- No multimodal fusion algorithm outperforms the unimodal baseline on hard (high-discrepancy) subsets.
- Modality-balanced datasets are necessary to expose such failure modes and to evaluate how well fusion algorithms or preference-tuned LMMs actually use underrepresented modalities.
- Balanced pretraining, as in VAST-27M (Chen et al., 2023), yields foundation models that surpass state-of-the-art performance across all tested modality pairings, confirming that construction protocol and groupwise cycling sustain balance through training.
Additional metrics include per-modality retrieval accuracy, response bias suppression, hallucination rates (e.g., CHAIR metrics in MBPO (Liu et al., 20 May 2025)), and empirical decrease in answer prior exploitation (Liu et al., 2023).
6. Recommendations and Impact
Research recommends systematic reporting of stratified performance by modality discrepancy (Xia et al., 2023), adoption of Shapley-value or permutation-invariant metrics for dataset curation (Xu et al., 15 Feb 2025), and providing spectrum splits of imbalances. Overemphasis on absolute balance may be suboptimal; relative balance—modality contributions being commensurate with their unique information—is preferable (Xu et al., 15 Feb 2025). Hierarchical balancing, dynamic reweighting during training, and robust evaluation protocols are advocated in new benchmarks such as MBE2.0.
The construction and deployment of modality-balanced datasets are essential for understanding, benchmarking, and pretraining next-generation multimodal and foundation models, particularly as multi-track or omni-modality corpora become the norm in both academic and real-world settings.