AD-FlowTSE: Deterministic TSE with Flow Matching
- AD-FlowTSE is a deterministic, generative target speaker extraction framework that uses optimal transport to model audio separation as a continuous flow between interference and target.
- It replaces standard mask estimation with a flow matching approach, incorporating mixing ratio-aware inference to streamline the extraction process.
- Empirical results demonstrate state-of-the-art SI-SDR and PESQ improvements with efficient one-step inference, despite challenges in accurate mixing ratio estimation.
Adaptive Deterministic Flow Matching for Target Speaker Extraction (AD-FlowTSE) is a deterministic, generative framework for target speaker extraction (TSE) that takes a flow-matching viewpoint, formulating the separation task as a continuous, mixture-ratio-aware transformation between the interference (background) and target source representations. AD-FlowTSE replaces traditional mask-estimation and fixed-diffusion processes with an optimal-transport-inspired, one-dimensional flow, yielding both state-of-the-art target speaker extraction accuracy and compelling efficiency, particularly when coupled with MR (mixing-ratio) estimation and one-step inference (Hsieh et al., 19 Oct 2025, Shimizu et al., 21 Dec 2025).
1. Problem Definition and Motivating Principles
AD-FlowTSE is designed for the scenario where an audio mixture combines a target source and background as . The extraction is conditioned on both the mixture and an enrollment or reference utterance from the desired speaker. The framework operates in the complex short-time Fourier transform (STFT) domain, mapping waveforms and reference utterances to spectral representations , , .
Classical discriminative TSE methods output a mask or directly estimate from . In contrast, AD-FlowTSE treats the extraction as a continuous deterministic transport (an ODE) moving to . This is achieved by learning a neural velocity field : integrating recovers from background , conditioned on the auxiliary enrollment .
Key to the paradigm is the explicit modeling of the mixture as a convex combination rather than solely an additive process: , with termed the mixing ratio (MR). This formulation confers scale invariance and ties the transport geometry in spectral space to the physical mixture composition (Shimizu et al., 21 Dec 2025).
2. Mathematical Flow Structure
2.1 Mixture-as-Transport Convex Combination
In the spectral (STFT) domain, the flow's endpoints are and . The entire optimal transport path is parameterized by :
The observed mixture itself corresponds to along the path, with the true (oracle) mixing ratio.
2.2 Deterministic Flow Matching Objective
The model's velocity field is trained to match the instantaneous velocity at every point along the straight trajectory in spectral space:
Compared to standard flow matching, here the endpoints are the background and target rather than a Gaussian and data, focusing explicitly on the relevant TSE interpolation.
2.3 Mixing Ratio-Aware Inference
At inference, the true mixing ratio is generally unknown. AD-FlowTSE uses a neural predictor to estimate , initializing the transport at , effectively starting the ODE at . Only the segment is integrated, which adapts inference to the actual content of the mixture, reducing redundant computation (Shimizu et al., 21 Dec 2025).
3. Extensions: Mean-Flow and One-Step Inference
The MeanFlow-TSE extension replaces the instantaneous-velocity regression with a mean (or “-Flow”) objective, better suited for single-step (NFE=1) generative inference (Shimizu et al., 21 Dec 2025). The objective learns the average “jump” velocity needed for a one-step transport from to :
A hybrid target
with , interpolates between pure mean-flow and standard flow-matching, with adaptive weighting and -curriculum scheduling to stabilize and anneal training. The ultimate adaptive loss is
where .
At inference, the estimate of is computed in a single step:
where and (Shimizu et al., 21 Dec 2025).
4. Algorithmic Implementation
Training Pseudocode
Initialization: For each training batch, sample , , , and compute oracle .
For each training batch:
- Sample , (e.g., log-normal).
- Compute , .
- Set , .
- , and residual .
- Adaptive weight .
- Compute and update parameters.
- Update according to the curriculum.
One-Step Inference
Input: Mixture , enrollment .
- Predict .
- , , .
- .
- .
- Output: Inverse STFT as the estimated waveform (Shimizu et al., 21 Dec 2025).
5. Empirical Performance and Practical Implications
MeanFlow-TSE (AD-FlowTSE + mean-flow) achieves:
- State-of-the-art separation quality in SI-SDR and PESQ, e.g., +1.3 dB SI-SDR improvement over AD-FlowTSE on Libri2Mix (Clean).
- Single-step inference (NFE=1), with real-time factor (RTF ≈ 0.018 on GPU).
- Marginal increase in model parameters and memory versus AD-FlowTSE.
- Preserves generative naturalness—one-step generation mitigates accumulation of artifacts sometimes introduced by multi-step samplers.
Key empirically demonstrated advantage is that performance peaks for –$5$ steps and can degrade with more, confirming that MR-aware initialization suffices to position mixtures near the optimal endpoint (Hsieh et al., 19 Oct 2025, Shimizu et al., 21 Dec 2025).
6. Limitations and Open Problems
AD-FlowTSE’s principal limitation is dependence on accurate mixing ratio estimation ; errors in directly bias the single-step jump and can impair extraction quality. The present formulation is specialized to single-channel, non-reverberant mixtures; adaptation to reverberant or multi-channel cases remains open. On non-intrusive “naturalness” metrics (e.g., OVRL, DNSMOS), slight underperformance is observed versus some multi-step approaches, although core separation and perceptual metrics remain superior (Shimizu et al., 21 Dec 2025).
7. Related Work and Context
AD-FlowTSE represents a shift from predictive and fixed-schedule diffusion models, embedding generative transport directly along the background–target axis and conditioning on MR rather than time or noise variance (Hsieh et al., 19 Oct 2025). The mean-flow-based one-step formulation in “MeanFlow-TSE” bridges optimal-transport and flow-matching advances, extending the adaptive ODE approach (Shimizu et al., 21 Dec 2025). For foundational work on generative TSE and flow-matching, see “Adaptive Deterministic Flow Matching for Target Speaker Extraction” (Hsieh et al., 19 Oct 2025) and “MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow” (Shimizu et al., 21 Dec 2025). This paradigm is conceptually distinct from anomaly detection and low-rank decomposition in tensor flows (Schynol et al., 2024).