Papers
Topics
Authors
Recent
Search
2000 character limit reached

AD-FlowTSE: Deterministic TSE with Flow Matching

Updated 24 December 2025
  • AD-FlowTSE is a deterministic, generative target speaker extraction framework that uses optimal transport to model audio separation as a continuous flow between interference and target.
  • It replaces standard mask estimation with a flow matching approach, incorporating mixing ratio-aware inference to streamline the extraction process.
  • Empirical results demonstrate state-of-the-art SI-SDR and PESQ improvements with efficient one-step inference, despite challenges in accurate mixing ratio estimation.

Adaptive Deterministic Flow Matching for Target Speaker Extraction (AD-FlowTSE) is a deterministic, generative framework for target speaker extraction (TSE) that takes a flow-matching viewpoint, formulating the separation task as a continuous, mixture-ratio-aware transformation between the interference (background) and target source representations. AD-FlowTSE replaces traditional mask-estimation and fixed-diffusion processes with an optimal-transport-inspired, one-dimensional flow, yielding both state-of-the-art target speaker extraction accuracy and compelling efficiency, particularly when coupled with MR (mixing-ratio) estimation and one-step inference (Hsieh et al., 19 Oct 2025, Shimizu et al., 21 Dec 2025).

1. Problem Definition and Motivating Principles

AD-FlowTSE is designed for the scenario where an audio mixture yRLy \in \mathbb{R}^L combines a target source ss and background bb as y=s+by = s + b. The extraction is conditioned on both the mixture and an enrollment or reference utterance ee from the desired speaker. The framework operates in the complex short-time Fourier transform (STFT) domain, mapping waveforms yy and reference utterances ee to spectral representations Y=STFT(y)Y = \mathrm{STFT}(y), S=STFT(s)S = \mathrm{STFT}(s), B=STFT(b)B = \mathrm{STFT}(b).

Classical discriminative TSE methods output a mask or directly estimate SS from YY. In contrast, AD-FlowTSE treats the extraction as a continuous deterministic transport (an ODE) moving BB to SS. This is achieved by learning a neural velocity field vθ(z,t,e)v_\theta(z, t, e): integrating dz/dt=vθ(z,t,e)dz/dt = v_\theta(z, t, e) recovers SS from background BB, conditioned on the auxiliary enrollment ee.

Key to the paradigm is the explicit modeling of the mixture as a convex combination rather than solely an additive process: y=λs+(1λ)by = \lambda s + (1-\lambda) b, with λ[0,1]\lambda \in [0, 1] termed the mixing ratio (MR). This formulation confers scale invariance and ties the transport geometry in spectral space to the physical mixture composition (Shimizu et al., 21 Dec 2025).

2. Mathematical Flow Structure

2.1 Mixture-as-Transport Convex Combination

In the spectral (STFT) domain, the flow's endpoints are z0=Bz_0 = B and z1=Sz_1 = S. The entire optimal transport path is parameterized by t[0,1]t \in [0, 1]:

zt=(1t)B+tSz_t = (1-t) B + t S

The observed mixture YY itself corresponds to zλz_\lambda along the path, with λ\lambda the true (oracle) mixing ratio.

2.2 Deterministic Flow Matching Objective

The model's velocity field vθ(zt,t,e)v_\theta(z_t, t, e) is trained to match the instantaneous velocity (SB)(S-B) at every point along the straight trajectory in spectral space:

LFM(θ)=EtUnif[0,1],B,S,e[vθ(zt,t,e)(SB)22]\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t \sim \text{Unif}[0,1], B,S,e} \left[ \| v_\theta(z_t, t, e) - (S-B) \|^2_2 \right]

Compared to standard flow matching, here the endpoints are the background and target rather than a Gaussian and data, focusing explicitly on the relevant TSE interpolation.

2.3 Mixing Ratio-Aware Inference

At inference, the true mixing ratio λ\lambda is generally unknown. AD-FlowTSE uses a neural predictor gϕ(y,e)g_\phi(y, e) to estimate λ^\hat{\lambda}, initializing the transport at t=λ^t = \hat{\lambda}, effectively starting the ODE at zλ^=Yz_{\hat{\lambda}} = Y. Only the segment t[λ^,1]t \in [\hat{\lambda}, 1] is integrated, which adapts inference to the actual content of the mixture, reducing redundant computation (Shimizu et al., 21 Dec 2025).

3. Extensions: Mean-Flow and One-Step Inference

The MeanFlow-TSE extension replaces the instantaneous-velocity regression with a mean (or “α\alpha-Flow”) objective, better suited for single-step (NFE=1) generative inference (Shimizu et al., 21 Dec 2025). The objective learns the average “jump” velocity needed for a one-step transport from tt to rr:

vavg=(zrzt)/(rt)=SBv_\text{avg} = (z_r - z_t)/(r-t) = S - B

A hybrid target

vt,rα=α(SB)+(1α)vθ(zτ,τ,r,e)v_{t,r}^{\alpha} = \alpha \cdot (S-B) + (1-\alpha) \cdot v_\theta(z_{\tau}, \tau, r, e)

with τ=αr+(1α)t\tau = \alpha r + (1-\alpha)t, interpolates between pure mean-flow and standard flow-matching, with adaptive weighting and α\alpha-curriculum scheduling to stabilize and anneal training. The ultimate adaptive loss is

Ladapt(θ)=E[sg(w)Δ22]\mathcal{L}_{\text{adapt}}(\theta) = \mathbb{E}[ \text{sg}(w) \cdot \| \Delta \|^2_2 ]

where w=α/(Δ22+c)w=\alpha/(\|\Delta\|_2^2 + c).

At inference, the estimate of SS is computed in a single step:

S^=zstart+(1λ^)vθ(zstart,tstart,1,e)\hat{S} = z_{\text{start}} + (1 - \hat{\lambda}) v_\theta(z_{\text{start}}, t_{\text{start}}, 1, e)

where zstart=Yz_{\text{start}}=Y and tstart=λ^t_{\text{start}}=\hat{\lambda} (Shimizu et al., 21 Dec 2025).

4. Algorithmic Implementation

Training Pseudocode

Initialization: For each training batch, sample SS, BB, ee, and compute oracle λ\lambda.

For each training batch:

  • Sample t,r[0,1]t, r \in [0,1], t<rt<r (e.g., log-normal).
  • Compute zt=(1t)B+tSz_t = (1-t)B + tS, u=SBu = S-B.
  • Set τ=αr+(1α)t\tau = \alpha r + (1-\alpha)t, vhybrid=αu+(1α)vθ((1τ)B+τS,τ,r,e)v_{\text{hybrid}} = \alpha u + (1-\alpha) v_\theta((1-\tau)B+\tau S, \tau, r, e).
  • vpred=vθ(zt,t,r,e)v_{\text{pred}} = v_\theta(z_t, t, r, e), and residual Δ=vpredvhybrid\Delta = v_{\text{pred}} - v_{\text{hybrid}}.
  • Adaptive weight w=α/(Δ2+c)w = \alpha/(\|\Delta\|^2 + c).
  • Compute Ladapt\mathcal{L}_{\text{adapt}} and update parameters.
  • Update α\alpha according to the curriculum.

One-Step Inference

Input: Mixture YY, enrollment ee.

  • Predict λ^gϕ(Y,e)\hat{\lambda} \leftarrow g_\phi(Y,e).
  • zstartYz_{\text{start}} \leftarrow Y, tstartλ^t_{\text{start}} \leftarrow \hat{\lambda}, r1r \leftarrow 1.
  • vvθ(zstart,tstart,1,e)v \leftarrow v_\theta(z_{\text{start}}, t_{\text{start}}, 1, e).
  • S^zstart+(1λ^)v\hat{S} \leftarrow z_{\text{start}} + (1-\hat{\lambda}) v.
  • Output: Inverse STFT(S^)(\hat{S}) as the estimated waveform (Shimizu et al., 21 Dec 2025).

5. Empirical Performance and Practical Implications

MeanFlow-TSE (AD-FlowTSE + mean-flow) achieves:

  • State-of-the-art separation quality in SI-SDR and PESQ, e.g., +1.3 dB SI-SDR improvement over AD-FlowTSE on Libri2Mix (Clean).
  • Single-step inference (NFE=1), with real-time factor (RTF ≈ 0.018 on GPU).
  • Marginal increase in model parameters and memory versus AD-FlowTSE.
  • Preserves generative naturalness—one-step generation mitigates accumulation of artifacts sometimes introduced by multi-step samplers.

Key empirically demonstrated advantage is that performance peaks for J=1J=1–$5$ steps and can degrade with more, confirming that MR-aware initialization suffices to position mixtures near the optimal endpoint (Hsieh et al., 19 Oct 2025, Shimizu et al., 21 Dec 2025).

6. Limitations and Open Problems

AD-FlowTSE’s principal limitation is dependence on accurate mixing ratio estimation λ^\hat{\lambda}; errors in λ^\hat{\lambda} directly bias the single-step jump and can impair extraction quality. The present formulation is specialized to single-channel, non-reverberant mixtures; adaptation to reverberant or multi-channel cases remains open. On non-intrusive “naturalness” metrics (e.g., OVRL, DNSMOS), slight underperformance is observed versus some multi-step approaches, although core separation and perceptual metrics remain superior (Shimizu et al., 21 Dec 2025).

AD-FlowTSE represents a shift from predictive and fixed-schedule diffusion models, embedding generative transport directly along the background–target axis and conditioning on MR rather than time or noise variance (Hsieh et al., 19 Oct 2025). The mean-flow-based one-step formulation in “MeanFlow-TSE” bridges optimal-transport and flow-matching advances, extending the adaptive ODE approach (Shimizu et al., 21 Dec 2025). For foundational work on generative TSE and flow-matching, see “Adaptive Deterministic Flow Matching for Target Speaker Extraction” (Hsieh et al., 19 Oct 2025) and “MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow” (Shimizu et al., 21 Dec 2025). This paradigm is conceptually distinct from anomaly detection and low-rank decomposition in tensor flows (Schynol et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AD-FlowTSE Paradigm.