Audio-Visual Foundation Models
- Audio-Visual Foundation Models are large-scale neural architectures that jointly process auditory and visual inputs, enabling comprehensive scene understanding and cross-modal retrieval.
- They employ shared-backbone representations, dual-stream transformers, and prompt-based modular fusion to achieve precise synchronization, alignment, and task-specific performance.
- Training strategies leverage vast paired data and LLM-guided curation, yielding efficient multimodal learning with competitive metrics like top-1 classification accuracy and mIoU improvements.
Audio-visual foundation models are large-scale neural architectures that jointly model both auditory and visual modalities to enable comprehensive scene understanding, cross-modal generation, retrieval, and downstream multimodal reasoning. These models integrate advances in self-supervised and generative learning, leveraging vast paired data and, increasingly, exploiting the capabilities of LLMs for data curation, modular fusion, and open-vocabulary reasoning. Research in this domain encompasses the development of unified representation spaces, joint generation pipelines, segmentation/localization, and evaluation protocols tailored for the diverse and often asynchronous nature of audio and video signals.
1. Core Architectural Paradigms for Audio-Visual Foundation Models
The architectural landscape in audio-visual foundation models spans shared-backbone representation learning, dual-stream cross-attentional generative frameworks, prompt-based modular fusion, and lightweight connector schemes.
The Unified Audio-Visual Model (UAVM) exemplifies a general approach: modality-specific frozen encoders (e.g., ConvNeXt-Base on log-mel spectrograms or RGB frames) feed into a shared transformer backbone. Parameter sharing is modulated by depth and embedding dimensionality, facilitating an interpolation between modality-independent and fully unified representations. This structure allows both modality-agnostic classification and, through mean-pooled shared features, enables retrieval via cosine similarity. Notably, the UAVM achieves top-1 event classification accuracy of 65.8% on VGGSound (Gong et al., 2022).
Joint generative models such as Seedance 1.5 pro and LTX-2 employ parallel audio and video branches—implemented with diffusion transformers—and couple them via cross-modal attention or adaptive layer normalization (AdaLN) at each layer. The video branch typically employs a deeper, wider transformer to accommodate the higher spatial-temporal complexity, while the audio branch models the 1D temporal signal. Each branch receives information from both the multilingual text encoder and the other modality via cross-modal mechanisms, which is crucial for fine-grained synchronization and narrative alignment (Chen et al., 15 Dec 2025, HaCohen et al., 6 Jan 2026). In both systems, no explicit multimodal contrastive loss is used during joint training; alignment is an emergent property of architecture and conditioning.
Prompt-based modularity arises in segmentation and localization, where sound-derived prompts steer visual foundation models such as SAM or Mask2Former. The encoder-prompt-decoder paradigm uses semantic-aware audio prompts (SAP), injected into the mask token stream, enabling the visual decoder to focus on probable sound sources while only updating lightweight adapters (ColA) and keeping the large detection backbone frozen. This modularity yields better zero-shot and cross-dataset generalization compared to early fusion strategies (Wang et al., 2023).
For cross-modal generation (e.g., vision-to-audio), lightweight mappers translate between the latent spaces of frozen vision and audio FMs (CLIP ↔ CLAP or CAVP/TimeChat fusion to AudioMAE), as in V2A-Mapper and MFM-Mapper. Both regression-based and diffusion-based mapping mechanisms are explored. The latter supports one-to-many mappings, boosting output variability and fidelity (Wang et al., 2023, Chen et al., 5 Sep 2025).
2. Data Curation, Training Strategies, and Efficiency
Audio-visual foundation models rely critically on the quality and quantity of paired multimodal data. AVVA formalizes a data-efficient approach by using an LLM-based Multimodal Reasoning Engine (MRE) for aggressive yet nuanced curation. Each candidate 3s segment is scored on multidimensional alignment aspects (temporal, spatial, contextual, causality, and source visibility) using combinations of LLaMA-based audio/video LLMs and a specialized alignment judge (Mistral7B). Clips above a threshold (e.g., ≥6.2 or 7.6 on a 0–10 scale) are retained, filtering out poorly aligned or off-screen content. This enables a 30× reduction in training data: AVVA matches or exceeds state-of-the-art retrieval benchmarks with only 192 curated hours versus ImageBind’s 5800+ hours (Vosoughi et al., 12 Mar 2025).
Contrastive learning objectives remain central for bimodal representation learning. AVVA adapts the InfoNCE/CLIP objective to audio–video pairs; only projection MLPs (~10M parameters) and the cross-attention block are trained, while encoders are frozen (Vosoughi et al., 12 Mar 2025). BAVS, targeting segmentation, integrates vision-only and audio-only priors in a bootstrapped, late-fusion architecture, explicitly handling label shift and noisy tags via the Silent Object-Aware Objective (SOAO) (Liu et al., 2023).
Generative models combine large-scale pretraining (unsupervised and supervised) with domain-targeted fine-tuning. Seedance 1.5 pro, after multi-task pretraining, uses supervised fine-tuning (SFT) for audio-video alignment and then applies RLHF with multidimensional reward signals (aesthetics, motion, synchronization) for further optimization (Chen et al., 15 Dec 2025). Knowledge distillation from large speech FMs (e.g., WavLM, iFLYTEK-speech) to multimodal students is achieved via representational regression and KLD on cluster labels, often leveraging multi-teacher ensembles, masking, and modality dropout during pretraining (Zhang et al., 9 Feb 2025).
Pseudo label–guided self-training (e.g., OpenAVS-ST) extends the impact of foundation models in low-resource settings—OpenAVS, a pure training-free pipeline, generates open-vocabulary segmentation masks by audio captioning, LLM-based translation to object names, and grounding these in images via models like Grounded-SAM. The resulting pseudo-masks can drive downstream supervised AVS training for significant gains in mIoU and F-score (+3–8 points) (Chen et al., 30 Apr 2025).
3. Audio-Visual Synchronization, Alignment, and Evaluation
Precise temporal and semantic alignment between modalities is both a requirement and an evaluation target for audio-visual foundation models.
Generative frameworks impose explicit synchronization objectives. Seedance 1.5 pro applies a SyncNet-based loss—minimizing between video and audio embedding networks—yielding lip-sync frame errors as low as 0.05 frames (from 0.20 in earlier versions). Phoneme-to-frame forced alignment further enables fine dialectal and prosodic control (Chen et al., 15 Dec 2025). LTX-2 relies on cross-modal AdaLN and shared timestep conditioning across a video-to-audio transformer stack, calibrated with modality-aware CFG at inference to maintain synchrony and prompt-relevance even in long (∼20 s) joint generations (HaCohen et al., 6 Jan 2026).
In segmentation, robust alignment is mediated by text-bridged mechanisms; TAViS uses joint audio–visual–text spaces (ImageBind plus SAM2). Prototype anchoring in text space (pseudo-text) bridges fine-grained object queries to shared semantic concepts, enforced with audio-to-text and image-to-text cross-entropy objectives (Luo et al., 13 Jun 2025).
Dedicated metrics and datasets have been developed for rigorous benchmarking. VGGSounder provides multi-label, modality-aware annotations for more precise dissection of model performance, introducing the modality confusion metric () to quantify the proportion of samples solvable unimodally but "forgotten" in fusion. Empirically, non-negligible confusion rates (4–15%) persist across both classifier and foundation models, revealing imperfect fusion and common modality biases (audio or vision dominance depending on training) (Zverev et al., 11 Aug 2025).
4. Specialized Audio-Visual Tasks: Generation, Segmentation, and Multimodal Language Modeling
Audio-visual foundation models serve as the backbone for a spectrum of applications beyond generic classification or retrieval.
Joint generation models (e.g., Seedance 1.5 pro, LTX-2) synthesize temporally and semantically coherent video and audio in a single system. Capabilities include precise lip-syncing, background and foley sound generation, dynamic camera movement control (through learned trajectory embeddings), and narrative consistency across multi-shot stories. For LTX-2, bidirectional cross-attention and adaptive layer-norm ensure that audio tracks, including speech, music, and environmental sound, correspond both globally and locally to the visual stream (Chen et al., 15 Dec 2025, HaCohen et al., 6 Jan 2026).
Segmentation and localization have advanced through encoder-prompt-decoder and text-bridged designs. GAVS prompts large VFMs with semantic-aware audio features for object segmentation, yielding superior zero- and few-shot generalization (e.g., cross-dataset mIoU improvements on VGG-SS, +4–12 points over fusion baselines) (Wang et al., 2023). BAVS incorporates hierarchical audio-visual ontologies and late fusion to robustly separate authentic-sounding from silent or noisy objects, with minimal degradation under audio interference (Liu et al., 2023). OpenAVS and TAViS exploit text intermediates to bridge model spaces and support open-vocabulary or semantic segmentation with minimal supervision (Chen et al., 30 Apr 2025, Luo et al., 13 Jun 2025).
Audio-visual LLMs have begun to address expressive speech generation. AVLM demonstrates that visual representations of facial expression, when fused via prefix-based cross-attention, reduce perplexity and improve downstream emotion recognition and dialog generation F1 by +5 over speech-only models. Fusion strategies leveraging 3D mesh-based features (SMIRK) and Q-Former infill/prefix mechanisms optimize this joint modeling (Tan et al., 22 Aug 2025).
5. Cross-Modal Mapping and Lightweight Connector Mechanisms
Many state-of-the-art approaches exploit off-the-shelf foundation models for each modality and only train lightweight connector modules ("mappers"). V2A-Mapper and MFM-Mapper exemplify this paradigm for vision-to-audio generation: CLIP (vision) and CLAP/AudoMAE (audio) embeddings are aligned via regression or generative (diffusion) mapping. The generative mapping supports one-to-many correspondences, enhancing output fidelity (53% lower FD on VGGSound than Im2Wav) and variability, while regression mapping slightly improves pairwise relevance. MFM-Mapper fuses dual visual FMs and utilizes an autoregressively fine-tuned GPT-2 as the mapper, achieving +15.5% gain on IB-score for semantic alignment and a 12.4% reduction in temporal desynchronization relative to prior mapping methods, despite using only 16% of the training data (Wang et al., 2023, Chen et al., 5 Sep 2025).
Such schemes highlight the value of parameter efficiency (86% fewer trainable parameters than end-to-end V2A baselines), modular upgradeability, and rapid adaptation to new domains and modalities. Mapping strategies are increasingly sophisticated, exploring Transformer-based, diffusion, and hybrid architectures for balancing trade-offs between relevance, distributional fidelity, latency, and domain generalization.
6. Evaluation Methodologies and Benchmarking
Rigorous evaluation of audio-visual foundation models has necessitated the creation of new datasets and metrics capable of probing cross-modal understanding, alignment, and robustness.
VGGSounder sets a new standard by (i) providing multi-label annotations (∼40% of clips have ≥3 labels), (ii) labeling class modality (audible, visible, both), and (iii) including meta-labels for confounds (background music, static images). The metric directly quantifies the negative impact of modality fusion, exposing that adding video or audio can degrade performance on samples that unimodal models solved (Zverev et al., 11 Aug 2025).
Other tasks rely on established but specialized metrics: mIoU and F-score for segmentation (e.g., AVS-Benchmarks), SyncNet-based synchronization scores for generation, WER for AVSR, and distributional similarity metrics (Fréchet Distance, Inception Score, KL divergences) for generative evaluation. Open benchmarks and reference codebases (e.g., VGGSounder, AVSBench) are central to reproducibility and head-to-head comparison.
7. Limitations, Open Problems, and Future Directions
Current audio-visual foundation models demonstrate impressive cross-modal capabilities but face challenges in bias, scale, efficiency, and generalization.
Unimodal dominance remains a concern: vision-heavy training can "distract" or suppress audio performance (or vice versa), revealed both in quantitative rates and in qualitative case studies (Zverev et al., 11 Aug 2025). Large-scale data curation (via LLMs or annotator pipelines) introduces computational overheads; research into classifier distillation and heuristic filtering is ongoing for scalable deployment (Vosoughi et al., 12 Mar 2025). Cross-modal alignment in highly diverse or noisy domains (multi-speaker, overlapping sources, highly non-natural sounds) remains open—designing prompts and connector modules for such cases is an active area (Wang et al., 2023, Liu et al., 2023).
Emerging research is focused on extending these models to (i) hierarchically longer or more complex narratives (HaCohen et al., 6 Jan 2026), (ii) finer affective and semantic control in language modeling and dialog (Tan et al., 22 Aug 2025), (iii) more robust open-vocabulary, zero/few-shot learning (Chen et al., 30 Apr 2025, Luo et al., 13 Jun 2025), and (iv) deeper integration of world models for reasoning and grounded fact-checking. Modality-aware, multi-task curricula and hybrid connector architectures are being explored to further unify the representation and generation of speech, vision, and environmental sound.