Modality-Balanced Dataset

Updated 1 January 2026

Modality-balanced datasets are multimodal corpora designed to ensure that each modality (vision, audio, text) contributes equally, preventing dominance of any single input.
Construction methods include uniform distribution of sample-wise discrepancies, dynamic weighting strategies, and adversarial data generation to maintain balance.
Such datasets enable robust evaluation and improve model generalization by reducing bias, lowering hallucination rates, and exposing failures in multimodal fusion techniques.

A modality-balanced dataset is a multimodal benchmark, corpus, or distillation surrogate designed such that the informational or statistical contribution of each modality (e.g., vision, audio, text) is quantitatively balanced. This concept is critical for robust evaluation, algorithm development, and foundation model pretraining, as it prevents models from overfitting dominant modalities and ensures generalizable cross-modal reasoning.

1. Formal Definitions and Quantitative Metrics

Modality balance can be mathematically characterized at both the dataset and model levels. Principal metrics include sample-wise modality discrepancy, Shapley-value-based imbalance scoring, and objective-specific modality weights.

Sample-wise Modality Discrepancy ( $D_i$ ) (Xia et al., 2023):

$D_i = \bar{u}_i^{(a)} - \bar{u}_i^{(v)}$

where $\bar{u}_i^{(k)}$ is the average confidence of the $k$ th unimodal classifier on sample $i$ ( $k = a$ for audio, $v$ for visual).

Imbalance Degree (Shapley-based) (Xu et al., 15 Feb 2025):
- For $M$ modalities, the Shapley contribution $\phi^i$ of modality $i$ is:
$\phi^i = \frac{1}{|M|!} \sum_{\pi \in \Pi_M} \left[v(S^i_\pi \cup \{i\}) - v(S^i_\pi)\right]$

where $v(A)$ is the model accuracy using modalities $A$ . - The overall imbalance score:

$\mathcal{I} = \frac{1}{\binom{M}{2}} \sum_{i < j} |\phi^i - \phi^j|$

For bimodal case: $\mathcal{I} = |\phi^1 - \phi^2|$ .
Objective-specific Modality Weights $\omega_m$ (Nie et al., 16 Nov 2025):

$\omega_m = \frac{1}{|\mathcal{B}_m|} \sum_{b \in \mathcal{B}_m} \sum_{z=1}^Z p_{z,m} \tilde{G}_{z,b}$

with normalization $\sum_m \omega_m = 1$ .

A modality-balanced dataset ensures (a) the distribution of $D_i$ is approximately uniform (typically across $[-1,1]$ ), or (b) the global imbalance score $\mathcal{I}$ is minimized, typically $\leq 0.1$ for high balance.

2. Principles and Motivations for Modality Balance

Multimodal models often default to the dominant modality in the absence of balanced data, leading to modality collapse or bias (Zhang et al., 16 May 2025, Liu et al., 20 May 2025). Modality imbalance manifests as:

Over-concentration of representations within one modality, i.e., large intra-modal similarity and inter-modal distributional gap.
Answer or label frequency skew, causing models to learn spurious correlations (e.g., always answering "yes" to a majority binary template) (Liu et al., 2023).
Increased hallucination rates in generative models when language priors outweigh grounded perceptual signals (Liu et al., 20 May 2025).

Balanced datasets provide stress tests across degrees of modality agreement, revealing underperformance of fusion methods on high-discrepancy samples (Xia et al., 2023). Modality-balance is thus foundational for the development of generalizable and fair multimodal models.

3. Construction Methodologies

Uniform Distribution of Discrepancy

BalancedAudiovisual (Xia et al., 2023) constructs a dataset where per-sample discrepancy $D_i$ is uniformly distributed by:

Measuring $D_i$ via pre-trained unimodal models.
Stratifying clips into high correspondence, audio-only, and visual-only correct.
Performing long-tail correction: oversampling underrepresented bins and removing redundant samples to flatten the $D_i$ histogram.

Answer Frequency Balancing

MUSIC-AVQA v2.0 (Liu et al., 2023) achieves answer balance by:

Identifying templates where $\max_i p_{t,i} \geq \tau$ ( $\tau=0.6$ for binary, $0.5$ for k-way classification).
For each skewed template, collecting new QA pairs for deficit answers to enforce $|N_{t,yes} - N_{t,no}| \leq 1$ , producing $p \approx 0.5$ .

Dataset-Level and Training-Level Balancing

Explicit Multimodal Unit Counts: VAST-27M (Chen et al., 2023) ensures every video clip contains exactly one vision, audio, and subtitle track, and fixed numbers of captions for each, with uniform sampling and strict filtering for modality completeness.
Dynamic Weighting: MBE2.0 (Nie et al., 16 Nov 2025) employs adaptive modality weights ( $\omega_m$ ) and dynamic sample filtering to enforce both global and local balance across image-only, text-only, and multimodal samples.

Adversarial and Policy-Based Data Generation

MBPO (Liu et al., 20 May 2025) constructs balanced preference datasets for LMM alignment via:

Hard negative mining: generating "loser" responses by adversarial perturbation of images to elicit language-biased answers, paired with visually grounded "winner" responses.
Keeping equal representation of high and low Image Information Gain samples, resulting in a dataset with balanced modality utilization signals.

4. Representative Datasets and Benchmarks

Dataset/Benchmark	Modalities	Balance Method
BalancedAudiovisual	Audio + Visual	Uniform $D_i$ bins
MUSIC-AVQA v2.0	Audio + Visual + Text (QA)	Balanced answer freq.
VAST-27M	Vision + Audio + Subtitle + Text	Fixed tracks/captions
BalancedAV	Audio + Visual	Minimizing $\mathcal I$
MBE2.0	Text + Image + Multimodal	Dynamic $\omega_m$

BalancedAV (Xu et al., 15 Feb 2025) is constructed to yield the lowest empirical imbalance score ( $\mathcal I \sim 0.1$ ) among standard benchmarks, whereas CREMA-D exhibits the highest ( $\mathcal I \sim 0.3-0.4$ ), confirming that naive dataset assembly often creates strong modality dominance.

5. Evaluation and Analysis Protocols

Performance of algorithms on modality-balanced datasets is evaluated via overall, modality-preferred, and high-discrepancy subset accuracies (Xia et al., 2023). Key findings include:

No multimodal fusion algorithm outperforms the unimodal baseline on hard (high-discrepancy) subsets.
Modality-balanced datasets are necessary to expose such failure modes and to evaluate how well fusion algorithms or preference-tuned LMMs actually use underrepresented modalities.
Balanced pretraining, as in VAST-27M (Chen et al., 2023), yields foundation models that surpass state-of-the-art performance across all tested modality pairings, confirming that construction protocol and groupwise cycling sustain balance through training.

Additional metrics include per-modality retrieval accuracy, response bias suppression, hallucination rates (e.g., CHAIR metrics in MBPO (Liu et al., 20 May 2025)), and empirical decrease in answer prior exploitation (Liu et al., 2023).

6. Recommendations and Impact

Research recommends systematic reporting of stratified performance by modality discrepancy (Xia et al., 2023), adoption of Shapley-value or permutation-invariant metrics for dataset curation (Xu et al., 15 Feb 2025), and providing spectrum splits of imbalances. Overemphasis on absolute balance may be suboptimal; relative balance—modality contributions being commensurate with their unique information—is preferable (Xu et al., 15 Feb 2025). Hierarchical balancing, dynamic reweighting during training, and robust evaluation protocols are advocated in new benchmarks such as MBE2.0.

The construction and deployment of modality-balanced datasets are essential for understanding, benchmarking, and pretraining next-generation multimodal and foundation models, particularly as multi-track or omni-modality corpora become the norm in both academic and real-world settings.