Audio-Visual Single-Mic Speech Separation

Updated 8 February 2026

The paper introduces a novel method that utilizes visual cues, such as lip and face movements, to resolve permutation ambiguities in single-microphone audio mixtures.
It fuses time-domain and spectrogram-based audio features with visual embeddings through concatenation, attention, and recursive processes to achieve SI-SDR improvements up to 19.4 dB.
Advanced approaches incorporate generative diffusion priors and cross-modal losses, enabling robust unsupervised training and efficient speech recognition in challenging acoustic environments.

Audio-visual single-microphone speech separation refers to the task of recovering the constituent speech signals of multiple speakers (and potentially background noise) from a single-microphone audio mixture, using synchronized video streams—typically of the speakers’ faces or lips—to guide the separation process. Visual signals are leveraged to provide noise-immune and speaker-disambiguating cues that complement the ambiguous, permutation-prone single-channel audio, allowing separation models to robustly isolate each speaker’s voice in challenging multi-talker and noisy environments.

1. Problem Formulation and Motivation

Let $y \in \mathbb{R}^d$ denote the observed single-microphone mixture:

$y = \sum_{i=1}^K s^i + n + \epsilon$

where $s^i$ is the clean speech waveform of speaker $i$ , $n$ is structured background noise, and $\epsilon$ is a small white Gaussian noise term. Synchronized visual streams $V = \{V_i\}$ (usually lip or face crops per frame for each on-screen speaker) are assumed available.

The objective is the estimation of $\{s^i\}$ and, when modeled, $n$ , such that the output signals are both perceptually intact (high SI-SDR, STOI, PESQ, etc.) and correctly associated with their corresponding on-screen speakers. Visual cues help resolve the permutation ambiguity, distinguish overlapping utterances of similar timbre or content, and improve intelligibility under low-SNR conditions (Ephrat et al., 2018, Wang et al., 2022, Makishima et al., 2021).

This task is foundational for far-field speech recognition, diarization, meeting transcription, assistive listening, and scene analysis.

2. Audio-Visual Fusion Paradigms and Feature Extraction

Audio Feature Pipelines

The dominant approach encodes the mixture in either the time-frequency (TF) domain, via STFT ( $y \rightarrow Y = STFT\{y\}$ , with real+imaginary or power-law compressed features), or the time domain using learned 1D convolutional front-ends (as in Conv-TasNet backbones) (Ephrat et al., 2018, Pegg et al., 2024, Zhang et al., 2020). TF-domain systems exploit the spectral structure but may suffer from phase reconstruction errors; time-domain systems avoid this but require highly expressive networks.

Visual Feature Pipelines

Visual signals are extracted by cropping lip or full-face regions per frame, feeding them to pretrained visual encoders (e.g., ResNet-18 for face/lip, BRAVEn/CTCNet-Lip for lip motion, or face-ID models such as FaceNet) (Ephrat et al., 2018, Kalkhorani et al., 2024, Pegg et al., 2023). Lip embeddings provide robust phonetic-alignment cues, while face embeddings encode identity. Temporal convolutional or recurrent units (e.g., BLSTM, TCN) process these to capture articulation dynamics (Xue et al., 26 Sep 2025, Kalkhorani et al., 2024).

Modality Fusion Mechanisms

Fusion strategies include (1) concatenation of audio and visual embeddings prior to mask prediction (Ephrat et al., 2018, Makishima et al., 2021, Kalkhorani et al., 2024); (2) attention-based fusion, where visual features modulate audio paths via FiLM, cross-attention, or gating operations (Pegg et al., 2023, Lee et al., 2023, Xue et al., 26 Sep 2025); (3) recursive fusion, where intermediate audio estimates are fed back, together with visual streams, to an AVSR or semantic block for further refinement (Xue et al., 26 Sep 2025).

Early fusion (before mask estimation) and late fusion (on output features or masks) both appear, with the former yielding better separation and speaker mapping in recent SOTA architectures (Kalkhorani et al., 2024, Pegg et al., 2024).

3. Separation Network Architectures

A diversity of architectures is applied, partitioned into classical mask-based, end-to-end time-domain, and modern diffusion-model based systems:

Approach	Modality Domain	Fusion Technique	Model Backbone
Mask-based (early)	STFT	Concatenation, BLSTM	CNN/BLSTM, U-Net, TCN
Time-domain	Waveform (Conv1D)	Linear/attention	Conv-TasNet/TDANet, GRU
TF-recurrent/attention	STFT (complex masking)	CAF, dual-path RNN	RTFS blocks, TF-AR, SRU
Diffusion priors	Waveform/STFT	Gated/FiLM/CrossAttn	U-Net/NCSN++M, DDPM, ODE-SDE

Mask-based frameworks estimate real, complex, or IBM-like masks for each target and apply them to the mixed spectrogram, followed by iSTFT (Ephrat et al., 2018, Kalkhorani et al., 2024).
Time-domain frameworks (TDANet, Conv-TasNet) avoid explicit frequency representation, applying 1D convs and recurrent/refinement modules, often enhanced with multi-stage cross-modal fusion (Pegg et al., 2024, Zhang et al., 2020).
Recurrent and attention-based time-frequency models (e.g., RTFS-Net) decouple temporal and spectral modeling, separately using recurrent units along each axis, with cross-dimensional attention/gating for fusion (Pegg et al., 2023).
Diffusion-based approaches train generative priors for speech and (in advanced systems) structured noise, using score matching and inverse-sampling (via SDE/ODE) for separation, relying on visual guidance to bias the generative process toward the correct speaker and solve the ill-posedness typical of single-mic separation (Lee et al., 2023, Yemini et al., 17 Sep 2025, Yemini et al., 1 Feb 2026).

4. Training Objectives, Auxiliary Losses, and Supervision

The principal separation losses are the scale-invariant signal-to-noise ratio (SI-SNR), SDR improvement (BSS-Eval), magnitude L1/L2 distance in the spectral domain, and (for subjective fidelity) PESQ and ESTOI/STOI. In the mask-based and time-domain frameworks, permutation-invariant training (PIT) resolves output-label ambiguities (Ephrat et al., 2018, Pegg et al., 2024, Kalkhorani et al., 2024).

Augmenting signal-level objectives, many systems add cross-modal or correspondence constraints:

Cross-modal correspondence loss (CMC): aligns separated audio features with synchronized visual features by maximizing frame-level cosine similarity for matching pairs and minimizing for negatives (Makishima et al., 2021, Wang et al., 2022).
Contrastive and adversarial objectives: maximize alignment in identity/phonetic embedding spaces, using speaker-ID and lip-reading vectors, via triplet losses or adversarially trained discriminators (Wang et al., 2022).
Wasserstein and denoising-score matching: diffusion models use score-matching ( $\ell_2$ ) losses between the true data and denoised samples (Yemini et al., 17 Sep 2025, Yemini et al., 1 Feb 2026, Lee et al., 2023), with some (e.g., UniVoiceLite) explicitly regularizing the audio-visual latent space using Wasserstein metrics (Park et al., 7 Dec 2025).

Self-supervision and fully unsupervised settings are realized in several recent systems, which train only on clean speech and/or noise samples, never seeing paired noisy/clean mixes (Park et al., 7 Dec 2025, Yemini et al., 17 Sep 2025, Yemini et al., 1 Feb 2026).

5. Representative Architectures and Benchmark Performance

Recent years have produced a range of SOTA architectures, characterized by their separation backbone, fusion approach, efficiency, and empirical performance:

AV-CrossNet achieves SOTA SI-SDRi on LRS2/3-2Mix, VoxCeleb2-2Mix, and NTCD-TIMIT+WHAM! with only early fusion, 12 attention blocks, and no recurrence, demonstrating $16.8$–$18.3$ dB SI-SDRi with robust generalization (Kalkhorani et al., 2024).
TDFNet realizes low-latency, real-time-capable separation (sub-100 ms) by using only MHSA/GRU blocks for refinement, yielding $15.8$ dB SI-SNRi on LRS2-2Mix with $6.5$ M parameters (approximately $10\%$ fewer MACs vs. CTCNet) (Pegg et al., 2024).
RTFS-Net operates in the TF-domain with a $0.7$ M parameter core and $14.9$ dB SI-SNRi, introducing cross-modal attention fusion (CAF) and demonstrating that parameter-efficient models can outperform large time-domain networks (Pegg et al., 2023).
CSFNet (Coarse-to-Separate-Fine) executes recursive semantic enhancement, feeding coarse estimates and lips to an AVSR for refined semantic embedding, pushing clean/noisy SI-SDRi up to $19.4$ dB (LRS3-2Mix) and $15.67$ dB under noise, surpassing prior SOTA for multi-speaker and noisy settings (Xue et al., 26 Sep 2025).
Diffusion Priors: SSNAPS applies decoupled annealing posterior sampling with visual-guided speech and audio-only noise priors, handling 1–3 speakers and ambient noise in a fully unsupervised setting, outperforming leading supervised baselines in WER for all configurations (Yemini et al., 1 Feb 2026). AVDiffuSS employs cross-attention-based U-Nets in a two-stage diffusion framework, yielding $12.0$ dB SI-SDR on VoxCeleb2 and the highest MOS naturalness ratings ($4.44$) among AV and audio-only diffusion methods (Lee et al., 2023).

The following table compares select state-of-the-art models on standard benchmarks (SI-SDRi, PESQ)(all values from the cited works):

Model	LRS2-2Mix	LRS3-2Mix	VoxCeleb2-2Mix	NTCD-TIMIT+WHAM! (noisy)
AV-CrossNet	16.8 dB	18.3 dB	14.6 dB	15.6 dB
TDFNet-large	15.8 dB	—	—	—
RTFS-Net-12	14.9 dB	—	—	—
CSFNet (fine)	16.8 dB	19.4 dB	14.8 dB	15.7 dB
SSNAPS (WER-N, K=2)	—	—	19.2%	—
AVDiffuSS	—	—	12.0 dB	—

Performance values are SI-SDRi unless otherwise stated. See individual papers for additional metrics (PESQ, STOI).

6. Advanced Topics: Beyond Masking—Generative and Correlation-based Modelling

Recent advances emphasize generative modeling (diffusion/score-based methods) and explicit cross-modal disentanglement:

Diffusion Inverse Sampling: Both SSNAPS and DAVSS-NM cast separation as joint posterior sampling with separate diffusion priors for speech (conditioned on visual streams) and noise; this enables clean disaggregation even when no paired noisy-clean audio is available at train time. Posterior sampling proceeds by ODE-based prior prediction and Langevin correction, combining generative priors and audio likelihood (Yemini et al., 17 Sep 2025, Yemini et al., 1 Feb 2026).
Multi-modal Multi-Correlation Learning: Decomposing the interaction between audio and video into speaker identity (timbre–face) and phonetic content (phoneme–lip motion) correlation, and maximizing these via contrastive or adversarial losses, allows the model to pull apart otherwise inseparable cases (e.g., same-gender mixtures, similar content) (Wang et al., 2022).
Disentanglement: Adversarially removing identity from visual features, so only speech-related information is passed to the separator, yields better generalization and higher SDRi versus pure audio or naive AV baselines (Zhang et al., 2020).
Recursive Semantic Enhancement: Feeding back intermediate separations to an AVSR produces richer semantically descriptive visual embeddings, which when re-fused yield systematic SI-SDR/SNR improvements, especially in multi-speaker and occluded-visual cases (Xue et al., 26 Sep 2025).

These strategies extend model robustness, enable unsupervised training, and yield substantial gains on previously unsolved “hard” cases, such as highly similar voice timbres, severe noise, and mismatched scenarios.

7. Open Challenges, Application Domains, and Future Directions

Despite remarkable progress, audio-visual single-microphone speech separation faces several open challenges:

Occlusion and video dropout: All leading approaches require consistent, high-quality face/lip tracks. Degradation due to face occlusion, off-angle shots, or frame dropout can sharply reduce separation accuracy (Pegg et al., 2023, Xue et al., 26 Sep 2025).
Unseen speakers and off-screen sources: While explicit visual-speech disentanglement and FiLM/attention fusion help, handling off-screen speakers or dynamic speaker count requires further algorithmic innovation (Yemini et al., 1 Feb 2026).
Latency and real-time inference: Diffusion-based and multi-stage systems remain slower than mask-based approaches. Lightweight architectures such as TDFNet and UniVoiceLite demonstrate progress toward real-time, on-device deployment (Pegg et al., 2024, Park et al., 7 Dec 2025).
Generalization to new domains: AV-CrossNet and CSFNet report strong cross-dataset transfer, but face-encoder and visual front-end mismatch remains a concern (Kalkhorani et al., 2024, Xue et al., 26 Sep 2025).
Noise/general acoustics modeling: Jointly estimating structured ambient noise as a source improves speech separation and enables downstream applications (e.g., acoustic scene detection) (Yemini et al., 1 Feb 2026, Yemini et al., 17 Sep 2025).
Unified speech enhancement & separation: Models such as UniVoiceLite address both speech enhancement and multi-speaker separation in a single lightweight, visual-guided WAE, showing that joint modeling is practical even in unsupervised settings (Park et al., 7 Dec 2025).
Integration with AVSR and downstream tasks: Recursive pipelines, disentanglement, and generative AV priors offer promising synergies with speech recognition, diarization, and acoustic event detection, as demonstrated by improvements in word error rate and scene classification (Xue et al., 26 Sep 2025, Yemini et al., 1 Feb 2026).

In summary, audio-visual single-microphone speech separation leverages cross-modal cues and advanced deep learning/fusion architectures to robustly recover constituent signals far beyond what is achievable with audio alone. Ongoing research focuses on parameter efficiency, unsupervised learning, generative modeling, generalized source count, and greater robustness to visually degraded scenarios.