AD-FlowTSE: Deterministic TSE with Flow Matching

Updated 24 December 2025

AD-FlowTSE is a deterministic, generative target speaker extraction framework that uses optimal transport to model audio separation as a continuous flow between interference and target.
It replaces standard mask estimation with a flow matching approach, incorporating mixing ratio-aware inference to streamline the extraction process.
Empirical results demonstrate state-of-the-art SI-SDR and PESQ improvements with efficient one-step inference, despite challenges in accurate mixing ratio estimation.

Adaptive Deterministic Flow Matching for Target Speaker Extraction (AD-FlowTSE) is a deterministic, generative framework for target speaker extraction (TSE) that takes a flow-matching viewpoint, formulating the separation task as a continuous, mixture-ratio-aware transformation between the interference (background) and target source representations. AD-FlowTSE replaces traditional mask-estimation and fixed-diffusion processes with an optimal-transport-inspired, one-dimensional flow, yielding both state-of-the-art target speaker extraction accuracy and compelling efficiency, particularly when coupled with MR (mixing-ratio) estimation and one-step inference (Hsieh et al., 19 Oct 2025, Shimizu et al., 21 Dec 2025).

1. Problem Definition and Motivating Principles

AD-FlowTSE is designed for the scenario where an audio mixture $y \in \mathbb{R}^L$ combines a target source $s$ and background $b$ as $y = s + b$ . The extraction is conditioned on both the mixture and an enrollment or reference utterance $e$ from the desired speaker. The framework operates in the complex short-time Fourier transform (STFT) domain, mapping waveforms $y$ and reference utterances $e$ to spectral representations $Y = \mathrm{STFT}(y)$ , $S = \mathrm{STFT}(s)$ , $B = \mathrm{STFT}(b)$ .

Classical discriminative TSE methods output a mask or directly estimate $S$ from $Y$ . In contrast, AD-FlowTSE treats the extraction as a continuous deterministic transport (an ODE) moving $B$ to $S$ . This is achieved by learning a neural velocity field $v_\theta(z, t, e)$ : integrating $dz/dt = v_\theta(z, t, e)$ recovers $S$ from background $B$ , conditioned on the auxiliary enrollment $e$ .

Key to the paradigm is the explicit modeling of the mixture as a convex combination rather than solely an additive process: $y = \lambda s + (1-\lambda) b$ , with $\lambda \in [0, 1]$ termed the mixing ratio (MR). This formulation confers scale invariance and ties the transport geometry in spectral space to the physical mixture composition (Shimizu et al., 21 Dec 2025).

2. Mathematical Flow Structure

2.1 Mixture-as-Transport Convex Combination

In the spectral (STFT) domain, the flow's endpoints are $z_0 = B$ and $z_1 = S$ . The entire optimal transport path is parameterized by $t \in [0, 1]$ :

$z_t = (1-t) B + t S$

The observed mixture $Y$ itself corresponds to $z_\lambda$ along the path, with $\lambda$ the true (oracle) mixing ratio.

2.2 Deterministic Flow Matching Objective

The model's velocity field $v_\theta(z_t, t, e)$ is trained to match the instantaneous velocity $(S-B)$ at every point along the straight trajectory in spectral space:

$\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t \sim \text{Unif}[0,1], B,S,e} \left[ \| v_\theta(z_t, t, e) - (S-B) \|^2_2 \right]$

Compared to standard flow matching, here the endpoints are the background and target rather than a Gaussian and data, focusing explicitly on the relevant TSE interpolation.

2.3 Mixing Ratio-Aware Inference

At inference, the true mixing ratio $\lambda$ is generally unknown. AD-FlowTSE uses a neural predictor $g_\phi(y, e)$ to estimate $\hat{\lambda}$ , initializing the transport at $t = \hat{\lambda}$ , effectively starting the ODE at $z_{\hat{\lambda}} = Y$ . Only the segment $t \in [\hat{\lambda}, 1]$ is integrated, which adapts inference to the actual content of the mixture, reducing redundant computation (Shimizu et al., 21 Dec 2025).

3. Extensions: Mean-Flow and One-Step Inference

The MeanFlow-TSE extension replaces the instantaneous-velocity regression with a mean (or “ $\alpha$ -Flow”) objective, better suited for single-step (NFE=1) generative inference (Shimizu et al., 21 Dec 2025). The objective learns the average “jump” velocity needed for a one-step transport from $t$ to $r$ :

$v_\text{avg} = (z_r - z_t)/(r-t) = S - B$

A hybrid target

$v_{t,r}^{\alpha} = \alpha \cdot (S-B) + (1-\alpha) \cdot v_\theta(z_{\tau}, \tau, r, e)$

with $\tau = \alpha r + (1-\alpha)t$ , interpolates between pure mean-flow and standard flow-matching, with adaptive weighting and $\alpha$ -curriculum scheduling to stabilize and anneal training. The ultimate adaptive loss is

$\mathcal{L}_{\text{adapt}}(\theta) = \mathbb{E}[ \text{sg}(w) \cdot \| \Delta \|^2_2 ]$

where $w=\alpha/(\|\Delta\|_2^2 + c)$ .

At inference, the estimate of $S$ is computed in a single step:

$\hat{S} = z_{\text{start}} + (1 - \hat{\lambda}) v_\theta(z_{\text{start}}, t_{\text{start}}, 1, e)$

where $z_{\text{start}}=Y$ and $t_{\text{start}}=\hat{\lambda}$ (Shimizu et al., 21 Dec 2025).

4. Algorithmic Implementation

Training Pseudocode

Initialization: For each training batch, sample $S$ , $B$ , $e$ , and compute oracle $\lambda$ .

For each training batch:

Sample $t, r \in [0,1]$ , $t<r$ (e.g., log-normal).
Compute $z_t = (1-t)B + tS$ , $u = S-B$ .
Set $\tau = \alpha r + (1-\alpha)t$ , $v_{\text{hybrid}} = \alpha u + (1-\alpha) v_\theta((1-\tau)B+\tau S, \tau, r, e)$ .
$v_{\text{pred}} = v_\theta(z_t, t, r, e)$ , and residual $\Delta = v_{\text{pred}} - v_{\text{hybrid}}$ .
Adaptive weight $w = \alpha/(\|\Delta\|^2 + c)$ .
Compute $\mathcal{L}_{\text{adapt}}$ and update parameters.
Update $\alpha$ according to the curriculum.

One-Step Inference

Input: Mixture $Y$ , enrollment $e$ .

Predict $\hat{\lambda} \leftarrow g_\phi(Y,e)$ .
$z_{\text{start}} \leftarrow Y$ , $t_{\text{start}} \leftarrow \hat{\lambda}$ , $r \leftarrow 1$ .
$v \leftarrow v_\theta(z_{\text{start}}, t_{\text{start}}, 1, e)$ .
$\hat{S} \leftarrow z_{\text{start}} + (1-\hat{\lambda}) v$ .
Output: Inverse STFT $(\hat{S})$ as the estimated waveform (Shimizu et al., 21 Dec 2025).

5. Empirical Performance and Practical Implications

MeanFlow-TSE (AD-FlowTSE + mean-flow) achieves:

State-of-the-art separation quality in SI-SDR and PESQ, e.g., +1.3 dB SI-SDR improvement over AD-FlowTSE on Libri2Mix (Clean).
Single-step inference (NFE=1), with real-time factor (RTF ≈ 0.018 on GPU).
Marginal increase in model parameters and memory versus AD-FlowTSE.
Preserves generative naturalness—one-step generation mitigates accumulation of artifacts sometimes introduced by multi-step samplers.

Key empirically demonstrated advantage is that performance peaks for $J=1$ –$5$ steps and can degrade with more, confirming that MR-aware initialization suffices to position mixtures near the optimal endpoint (Hsieh et al., 19 Oct 2025, Shimizu et al., 21 Dec 2025).

6. Limitations and Open Problems

AD-FlowTSE’s principal limitation is dependence on accurate mixing ratio estimation $\hat{\lambda}$ ; errors in $\hat{\lambda}$ directly bias the single-step jump and can impair extraction quality. The present formulation is specialized to single-channel, non-reverberant mixtures; adaptation to reverberant or multi-channel cases remains open. On non-intrusive “naturalness” metrics (e.g., OVRL, DNSMOS), slight underperformance is observed versus some multi-step approaches, although core separation and perceptual metrics remain superior (Shimizu et al., 21 Dec 2025).

AD-FlowTSE represents a shift from predictive and fixed-schedule diffusion models, embedding generative transport directly along the background–target axis and conditioning on MR rather than time or noise variance (Hsieh et al., 19 Oct 2025). The mean-flow-based one-step formulation in “MeanFlow-TSE” bridges optimal-transport and flow-matching advances, extending the adaptive ODE approach (Shimizu et al., 21 Dec 2025). For foundational work on generative TSE and flow-matching, see “Adaptive Deterministic Flow Matching for Target Speaker Extraction” (Hsieh et al., 19 Oct 2025) and “MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow” (Shimizu et al., 21 Dec 2025). This paradigm is conceptually distinct from anomaly detection and low-rank decomposition in tensor flows (Schynol et al., 2024).

Markdown Report Issue Upgrade to Chat

References (3)

Adaptive Deterministic Flow Matching for Target Speaker Extraction (2025)

MeanFlow-TSE: One-Step Generative Target Speaker Extraction with Mean Flow (2025)

Adaptive Anomaly Detection in Network Flows with Low-Rank Tensor Decompositions and Deep Unrolling (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AD-FlowTSE Paradigm.

AD-FlowTSE: Deterministic TSE with Flow Matching

1. Problem Definition and Motivating Principles

2. Mathematical Flow Structure

2.1 Mixture-as-Transport Convex Combination

2.2 Deterministic Flow Matching Objective

2.3 Mixing Ratio-Aware Inference

3. Extensions: Mean-Flow and One-Step Inference

4. Algorithmic Implementation

Training Pseudocode

One-Step Inference

5. Empirical Performance and Practical Implications

6. Limitations and Open Problems

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AD-FlowTSE: Deterministic TSE with Flow Matching

1. Problem Definition and Motivating Principles

2. Mathematical Flow Structure

2.1 Mixture-as-Transport Convex Combination

2.2 Deterministic Flow Matching Objective

2.3 Mixing Ratio-Aware Inference

3. Extensions: Mean-Flow and One-Step Inference

4. Algorithmic Implementation

Training Pseudocode

One-Step Inference

5. Empirical Performance and Practical Implications

6. Limitations and Open Problems

7. Related Work and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research