BiAudio Dataset: Multimodal & Binaural Benchmarks
- BiAudio Dataset is a dual-challenge corpus featuring distinct benchmarks for imbalance analysis and video-driven binaural audio synthesis.
- It offers a balanced audiovisual collection with uniform modality discrepancies to evaluate fusion performance in multimodal classifiers.
- The binaural dataset employs FOA processing and dynamic camera simulation for high-fidelity spatial audio generation in diverse scenes.
The term "BiAudio Dataset" refers to two independently developed, large-scale audiovisual datasets designed for distinct research purposes in multimodal machine learning and spatial audio synthesis. Both datasets adopt the "BiAudio" name but differ in scope, construction principles, modality targets, and application domains. The first "BiAudio" dataset (Xia et al., 2023) provides a balanced corpus for investigating modality imbalance in audio-visual classification. The second "BiAudio" dataset (ViSAudio project, 2025) targets end-to-end video-driven binaural audio generation with open-domain spatial diversity. Both have become relevant benchmarks, reflecting advances in dataset design for multimodal learning and spatial audio generation (Xia et al., 2023, Zhang et al., 2 Dec 2025).
1. Dataset Purposes and Conceptual Distinctions
The BiAudio dataset introduced by Xia et al. (“Balanced Audiovisual Dataset for Imbalance Analysis”) explicitly addresses the challenge of modality bias and imbalance in multimodal learning. Its design principle imposes a uniform distribution of per-sample modality discrepancy, defined as , where and denote unimodal softmax confidences for audio and visual modalities, respectively (Xia et al., 2023).
In contrast, the BiAudio dataset used in ViSAudio (“ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation”) targets the generation and evaluation of spatially consistent binaural audio from monocular video sequences in real-world conditions. Here, the primary focus is on open-domain coverage, spatio-temporal diversity, and high-fidelity binaural audio derived from first-order Ambisonics (FOA), supporting research in audiovisual scene synthesis and 3D audio learning (Zhang et al., 2 Dec 2025).
2. Data Composition and Categories
BiAudio (Balanced Audiovisual)
- Total clips: 34,000, each 10 s in duration.
- Sources: Kinetics-400 (10,000 clips), VGG-Sound (8,000), YouTube (16,000).
- Categories: 30 categories, chosen to align across Kinetics-400 and VGG-Sound (e.g., "dog barking," "playing guitar").
- Labeling: Inherited or inferred labels, cross-verified by off-the-shelf classifiers, achieving approximate class-uniformity (±5%) (Xia et al., 2023).
BiAudio (Binaural Video–Audio)
- Total pairs: 97,000 video–binaural audio pairs, each 8 s.
- Scene diversity: Wide (indoor/outdoor, water, wildlife, crowds, traffic, etc.).
- Categories: 2,360 visible and 1,265 invisible sound categories from captions.
- Captions: Generated in two stages using Qwen2.5-Omni and Qwen3-Instruct for separating visible/background sound events (Zhang et al., 2 Dec 2025).
| Dataset | # Clips / Pairs | Duration | Categories |
|---|---|---|---|
| BiAudio (Balanced Audiovisual) | 34,000 | 10 s | 30 classified events |
| BiAudio (Binaural Video–Audio) | 97,000 | 8 s | 2,360 visible / 1,265 invisible |
3. Construction Protocols and Processing Pipelines
BiAudio (Balanced Audiovisual)
- Audio preprocessing: 48 kHz log-Mel spectrograms.
- Visual preprocessing: 1 fps sampled frames into SlowFast ResNet-18 (pretrained on Kinetics-400).
- Uniform modality discrepancy: Pretrained VGGish (audio) and SlowFast (visual) models generate confidence-based partitioning.
- Partitioning: High-correspondence (both modalities strong), audio-correct, and visual-correct (≈33% each).
- Outlier exclusion: Iterative re-estimation and removal for empirical flattened across (Xia et al., 2023).
BiAudio (Binaural Video–Audio)
- Source: Sphere360 FOA + 360° video corpus.
- Sound-source localization: FOA decomposition using spherical harmonics; energy map identifies principal auditory directions.
- Dynamic camera simulation: Piecewise-linear rotation of camera orientation over yaw (), pitch (), and roll (), with randomized drift per segment.
- Video rendering: 90° field-of-view, 512×512 px, extracted per trajectory from 360° equirectangular input.
- Binauralization: FOA channels ([W,X,Y,Z]) are dynamically rotated and then convolved with Omnitone FOA HRIRs to produce left/right binaural outputs.
- Quality filter: Channel-difference metric ensures spatial cue perceptibility (Zhang et al., 2 Dec 2025).
4. Data Formats, Splits, and Metadata
| Property | Balanced Audiovisual | Binaural Video–Audio |
|---|---|---|
| Clip format | 10s, log-Mel audio + sampled frames | 8s, 512×512 video + binaural .wav/.flac |
| Train/test split | Not explicitly fixed; overall stats | 94,845 train / 2,695 test (by video ID) |
| Labels/metadata | Category, audio/visual confidence | Captions, camera trajectory, audio params |
| Additional eval sets | – | MUSIC-21, FAIR-Play |
The Binaural BiAudio dataset includes detailed metadata per clip (original video ID, start time, camera trajectory, FOA-to-binaural parameters, generated captions). No separate validation splits are specified in either dataset, though researchers may reserve training subsets for this purpose.
5. Key Analytical Metrics and Experimental Results
BiAudio (Balanced Audiovisual)
- Dataset metric: Modality discrepancy per sample , with per-bin uniformity enforced over [0,1].
- Splits: Modality-preferred (audio- or visual-preferred for ) and modality-dominated (thresholded by ). Each split isolates samples where confidence difference exceeds specified thresholds.
- Benchmarks: Audio-only, Visual-only, late-fusion, and state-of-the-art imbalance-aware baselines (OGM-GE, Grad-Blending, Greedy).
- Results: No evaluated multimodal approach outperforms the best unimodal classifier across all discrepancy regimes. Fusion is beneficial in mid-δ regions but detrimental at high-δ, where modality noise dominates. Performance per method by regime is given in the following table:
| Method | Overall | Audio-Preferred | Visual-Preferred |
|---|---|---|---|
| Audio only | 62.96 | 77.76 | 44.79 |
| Visual only | 49.82 | 36.79 | 67.58 |
| Baseline fusion | 70.47 | 78.62 | 59.37 |
| OGM-GE | 72.27 | 78.92 | 63.59 |
| Grad-Blending | 72.04 | 79.49 | 61.89 |
| Greedy | 72.01 | 72.46 | 71.41 |
- Modality-dominated results: As increases, unimodal classifiers more strongly dominate their respective preferred domains; fusion models degrade in high-discrepancy regimes (Xia et al., 2023).
BiAudio (Binaural Video–Audio)
- Statistical properties: Uniform coverage of camera orientation and 3D viewpoint trajectory space. 2,360 visible and 1,265 invisible sound events from automatic captions.
- Spatial cue strength: Clips filtered by in channel difference, ensuring presence of perceptual spatialization.
- **No explicit classification or fusion results reported; intended primarily for spatial audio generation benchmarks and subjective/objective quality assessment (Zhang et al., 2 Dec 2025).
6. Mathematical Formulations and Core Metrics
Balanced Audiovisual BiAudio:
- Modality confidence:
- Discrepancy:
Binaural Video–Audio BiAudio:
- Spherical harmonic spatial energy:
- Principal direction:
- Camera trajectory:
- Channel-difference spatial cue:
7. Licensing, Accessibility, and Practical Considerations
Both datasets draw substantially from web-mined video sources, including YouTube under YouTube TOS and Creative Commons material under CC BY 4.0. The Binaural BiAudio dataset prohibits commercial redistribution, with all derivative usage limited to academic/non-commercial research (“fair use”) (Zhang et al., 2 Dec 2025). Release is planned via the ViSAudio project website, with all associated metadata and code for spherical harmonic analysis and HRIR convolution.
Recommended research usage for the Binaural version includes:
- Loading synchronized video/audio pairs via ffmpeg or PyAV.
- Employing metadata to replicate the spatialization pipeline for controlled experiments.
- Adhering to matched audio sample rates and the filter.
- For the Balanced Audiovisual version, ensuring control over distributions is recommended for developing and evaluating robust fusion methods.
8. Research Impact and Recommendations
The BiAudio datasets typify trends in multimodal benchmark construction. The Balanced Audiovisual BiAudio has exposed inherent limitations of popular fusion and imbalance-remedy methods, motivating development of -adaptive models and promoting the evaluation of performance as a function of modality discrepancy (Xia et al., 2023). The Binaural BiAudio has provided the first large-scale, paired video–binaural corpus suited to training and evaluation of direct end-to-end spatial audio generative models, supporting experiments on spatio-temporal alignment and perceptual immersion (Zhang et al., 2 Dec 2025). A plausible implication is that both datasets will remain central to benchmarking fusion robustness and spatial generative capacity in their respective research subsectors.