Audio-Visual Semantic Segmentation
- Audio-Visual Semantic Segmentation is a cross-modal task that assigns semantic labels to sound-emitting objects at the pixel level using both audio and visual inputs.
- Recent advances employ encoder-decoder architectures with temporal and bidirectional fusion strategies to improve localization and semantic precision.
- Innovative methods leverage foundation models, unsupervised training, and open-vocabulary approaches to enhance domain generalization and tackle annotation scarcity.
Audio-Visual Semantic Segmentation (AVSS) is a dense prediction problem that requires spatially precise semantic labeling of sounding objects within video data, leveraging both audio and visual modalities. In AVSS, the model assigns a class label to each pixel corresponding to an object making sound, thus jointly addressing spatial localization and semantic understanding with audio-visual correspondence. The field has evolved rapidly since the introduction of pixel-level AVS/AVSS benchmarks and now encompasses diverse methodologies, cross-modal architectures, and applications that significantly extend beyond conventional image or video segmentation.
1. Task Definition and Benchmark Datasets
AVSS extends Audio-Visual Segmentation (AVS) by requiring both pixel-level localization and semantic labeling of all sounding objects at each video frame (Zhou et al., 2023). Given a video sequence , the objective is to produce a semantic mask per frame, where denotes the number of target sounding object classes. This delineates AVSS from standard semantic segmentation (visual modality only), and from AVS/AVL (which may yield only localization masks or heatmaps).
Benchmarks:
- AVSBench-semantic (Zhou et al., 2023): The canonical dataset for AVSS, providing ~12k videos, 70 object categories, and pixel-wise masks annotated at 1 fps.
- AVSBench-object: Used for standard AVS, provides both single-source and multi-source settings. Semantic AVSS evaluations rely on "semantic labels" subset.
- AVSBench-OV (Guo et al., 2024): Extends AVSBench-semantic to open-vocabulary evaluation with “base” and “novel” category splits.
Evaluation metrics are typically mean Intersection-over-Union (mIoU) and F-score (β²=0.3), reported per class or as mean across classes.
2. Core Methodological Advances
AVSS research has produced several architectural paradigms centered on cross-modal fusion and semantic reasoning.
2.1 Encoder-Decoder Architectures with Cross-Modal Fusion
A standard pipeline is to use deep visual (e.g., ResNet-50, Pyramid Vision Transformer, Swin-Tiny) and audio (e.g., VGGish, BEATs, HTSAT) encoders, followed by joint processing and a segmentation decoder (Zhou et al., 2023, Tian et al., 23 Dec 2025, Li et al., 2023). Notable fusion strategies include:
- Temporal Pixel-Wise Audio-Visual Interaction (TPAVI): Injects temporal pixel-wise audio-visual affinities as attention weights into the visual feature hierarchy (Zhou et al., 2023, Zhou et al., 2022).
- Product Quantization-based Semantic Decomposition (QDFormer): Decomposes multi-source audio into disentangled single-source semantics via product quantization and global-to-local knowledge distillation. This enables robust multi-object reasoning and substantial performance improvement on AVSS benchmarks (+21.2 mIoU over prior in (Li et al., 2023)).
- Bidirectional Generation and Cycle Consistency: Enforces semantic alignment by requiring that segmentation masks enable reconstruction of the input audio features, and conversely that audio predicts visual features (bidirectional generation, (Hao et al., 2023)).
- Bilateral and Bidirectional Attention Mechanisms: E.g., COMBO’s Bilateral-Fusion Module executes bi-directional attention between fused visual/audio features; DDAVS’s Delayed Bidirectional Alignment injects cross-modal attention at late fusion stages to mitigate misalignment (Yang et al., 2023, Tian et al., 23 Dec 2025).
2.2 Instance-aware and Two-Stage Segmentation
Instance-level architectures decouple the localization of candidate object masks from the identification of the sounding source via audio (Liu et al., 2023, Liu et al., 2023). After potential objects are segmented, audio-visual semantic correlation (AVSC) or audio-visual semantic integration (AVIS) selects which objects are genuinely sounding, typically using audio category distributions inferred by a foundation audio classifier (BEATs, PANNs). This approach enhances performance in multi-source and noisy scenarios.
2.3 Spatio-Temporal and Prompt-based AVSS
Modern approaches emphasize spatio-temporal modeling. For example, "Stepping Stones" (Ma et al., 2024) and Stepping Stone Plus (SSP) (Gao et al., 13 Jan 2026) decompose AVSS into explicit subtasks: (1) localization (via motion/optical flow) and (2) semantic labeling (via prompt-informed cross-modal reasoning). Optical flow-based pre-masks restrict the spatial scope for semantic analysis, and textual or open-vocabulary prompts (e.g., CLIP, CAPs) inject prior knowledge necessary for fine-grained class disambiguation and open-set recognition (Gao et al., 13 Jan 2026, Guo et al., 2024).
2.4 Foundation Models and Annotation-Free Training
Recent methods leverage large-scale foundation models to overcome annotated data scarcity:
- SAM-based Frameworks: SAM-based AVSS (e.g., SAMA-AVS, AP-SAM, AV2T-SAM) use adapters or projection layers to integrate audio prompting. Notably, AV2T-SAM (Lee et al., 22 Feb 2025) maps both audio (via CLAP) and vision (via CLIP) to SAM’s text-prompt embedding space, achieving state-of-the-art alignment without fine-tuning the large backbone.
- Annotation-Free and Unsupervised Learning: Models like SAMA-AVS (Liu et al., 2023) utilize synthetic data generated by merging object segmentation datasets and audio event datasets, while MoCA (Bhosale et al., 2024) achieves unsupervised AVSS via feature matching using ImageBind, DINO, and SAM, employing contrastive learning and clustering.
2.5 Open-Vocabulary Audio-Visual Semantic Segmentation
Open-vocabulary AVSS (Guo et al., 2024) requires semantic segmentation and labeling of sounding objects beyond the fixed training classes, leveraging vision-LLMs (CLIP) for category assignment. OV-AVSS introduces a universal sound source localization module and an open-vocabulary CLIP-based classification head, providing strong zero-shot generalization (e.g., 29.1% mIoU on 30 novel classes).
3. Key Algorithmic Mechanisms and Training Strategies
| Mechanism | Description | References |
|---|---|---|
| TPAVI | Temporal pixel-wise audio-visual attention for feature-aligned fusion | (Zhou et al., 2023) |
| PQ-based Semantic Decomp. | Product quantization splits entangled audio into atomic tokens for each source | (Li et al., 2023) |
| Bidirectional Generation | Cycle-consistent visual-audio projection to enforce semantic information in the mask | (Hao et al., 2023) |
| Instance-aware AVSC/AVIS | Candidate objects localized visually, then audio selects true sound sources using classifiers | (Liu et al., 2023, Liu et al., 2023) |
| Maskige Priors & SEM | Foundation model-generated priors refine pixel features via Siam-Encoder | (Yang et al., 2023) |
| Prompt-based Labeling | Dual/CLIP-based textual prompts as priors for semantic context and sound category | (Gao et al., 13 Jan 2026, Lee et al., 22 Feb 2025) |
| Stepping Stones Training | Sequential optimization: first localization, then semantic labeling with mask-guided attention | (Ma et al., 2024, Gao et al., 13 Jan 2026) |
| Cross-modal Contrastive | Aligns distributions using InfoNCE, Gaussian, or Wasserstein distances | (Zha et al., 28 Jul 2025) |
| Adapter-based SAM fusion | Small trainable modules inject audio into frozen foundation segmentation models | (Lee et al., 22 Feb 2025, Liu et al., 2023) |
| Unsupervised Matching | KNN-driven contrastive objective aligns audio-visual representations without labels | (Bhosale et al., 2024) |
Notable Losses
- Segmentation Loss: Categorical cross-entropy for semantic masks; binary cross-entropy and Dice for sounding masks.
- Audio-Visual Mapping/Contrastive Losses: KL divergence, contrastive InfoNCE, cross-modal similarity or feature matching (Zhou et al., 2023, Zha et al., 28 Jul 2025).
- Quantization and Commitment Losses: Encourage alignment to a learned codebook for decomposed semantic tokens (Li et al., 2023).
- Adaptive Inter-Frame Consistency: Penalizes temporally inconsistent predictions, weighted by predicted similarity (Yang et al., 2023).
4. Quantitative Benchmarks and Comparative Performance
Empirical advances have been rigorously benchmarked on AVS/AVSS testbeds. Recent state-of-the-art results (PVT/ResNet backbones unless otherwise noted):
| Method | Architecture/fusion | S4 mIoU | MS3 mIoU | AVSS mIoU | Notes |
|---|---|---|---|---|---|
| TPAVI (Zhou et al., 2023) | Pixel-wise temporal attention | 78.7 | 54.0 | 29.8 | Baseline |
| AVSegFormer (Wang et al., 2024) | Efficient transformer/ELF | 79.9 | 57.9 | 31.2 | Real-time, speed/accuracy Pareto |
| QDFormer (Li et al., 2023) | PQ-based semantic decomp. | 81.8 | 61.6 | 46.6 | +21pt over prior AVSS SOTA |
| COMBO (Yang et al., 2023) | Pixel/modal/temporal entanglement | 84.7 | 59.2 | 42.1 | Multi-order bilateral fusion |
| SAMA-AVS (Liu et al., 2023) | SAM+adapters, synthetic data | 81.5 | 63.1 | – | Data-efficient (annotation-free) |
| AV2T-SAM (Lee et al., 22 Feb 2025) | CLIP⊙CLAP fusion in SAM decoder | 86.7 | 69.7 | – | New SOTA with frozen foundation model |
| DDAVS (Tian et al., 23 Dec 2025) | Disentanglement/delayed fusion | 92.4 (JF) | 75.1 (JF) | 52.6 (JF) | Prototype bank, bidir. attention |
| Open-Vocab (Guo et al., 2024) | Bi-attn CLIP/AudioMaskDec | 55.4 | 29.1 | – | Zero-shot on novel classes |
The table demonstrates that innovations in fusion, semantic disentanglement, and cross-modal prompting provide consistent gains over prior art, with QDFormer and DDAVS achieving new high watermarks. Open-vocabulary settings, previously unaddressed, now see >29% mIoU on unseen classes.
Ablation studies across these works reveal the necessity of sophisticated cross-modal fusion, explicit disentanglement in the presence of multiple sound sources, and semantic prior injection (through textual prompts or foundation models). Data-efficient training, real-time inference, and open-set recognition remain open research thrusts.
5. Architectural Innovations and Open Challenges
Modern AVSS systems integrate multiple technical advances:
- Foundation Model Integration: Leveraging CLIP, SAM, or text-prompted segmentation architectures yields robust semantic priors even with minimal AVSS-specific training data (Lee et al., 22 Feb 2025).
- Semantic Prompting and Open-Vocabulary Segmentation: Use of large vision-LLMs enables not only better in-distribution segmentation but also strong zero-shot performance (Guo et al., 2024).
- Product Quantization and Semantic Disentanglement: Decomposition of entangled representations is critical for accurate multi-source segmentation (Li et al., 2023, Tian et al., 23 Dec 2025).
- Cycle Consistent and Bidirectional Generation: Visual→audio generation constraints enforce semantic coupling beyond one-way attention (Hao et al., 2023).
- Progressive and Modular Training Regimes: Decomposing localization and semantic understanding (SSP/Stepping Stones) prevents conflicting optimization signals and yields better final segmentation (Gao et al., 13 Jan 2026, Ma et al., 2024).
Challenges:
- Temporal Modeling: Most systems remain frame-centric; fully leveraging temporal dependencies (beyond motion cues) for AVSS remains a relatively unexplored avenue (Li et al., 2023, Yang et al., 2023).
- Domain Generalization: Annotation-free and unsupervised AVSS show promise, but domain shift from synthetic to real is non-trivial (Liu et al., 2023, Bhosale et al., 2024).
- Fine-Grained and Instance-Level Segmentation: Most benchmarks target category-level masks; distinguishing instances of the same class sounding simultaneously is an open problem (Liu et al., 2023).
6. Applications and Future Research Directions
AVSS enables applications ranging from multimodal scene understanding, robotics, AR, surveillance, to open-domain video retrieval and editing. The paradigm's focus on true audio-visual correspondence with semantic specificity uniquely positions it for tasks in active perception and embodied AI.
Emerging research directions include:
- Lifelong and Continual Learning for AVSS: Prototype banks and semantic query adaptation may allow for online or few-shot expansion to new classes (Tian et al., 23 Dec 2025).
- Enhanced Temporal Reasoning: Incorporating video transformers or temporal diffusion models to capture long-range context and event dynamics.
- Audio-Visual Instance Segmentation and Open-Set/Zero-Shot Learning: Enabling fine-grained, instance-level masks and robust generalization to unseen objects/sounds (Guo et al., 2024).
- Scalable Annotation-free Training: Large-scale foundation model pretraining and weakly- or un-supervised domain adaptation (Liu et al., 2023, Bhosale et al., 2024).
- Real-Time and Edge-Deployable AVSS: Efficient architectures (e.g., AVESFormer (Wang et al., 2024)) and lightweight backbones address speed and resource constraints.
In conclusion, Audio-Visual Semantic Segmentation has become a key research challenge at the intersection of computer vision, audio signal processing, and multimodal representation learning. The field continues to advance rapidly as new techniques in foundation model adaptation, semantic disentanglement, and cross-modal learning are integrated and benchmarked at scale.