Papers
Topics
Authors
Recent
Search
2000 character limit reached

MeanFlow-TSE: One-Step Target Speaker Extraction

Updated 24 December 2025
  • The paper introduces MeanFlow-TSE, a one-step generative TSE framework that efficiently extracts target speech using a single network evaluation and achieves state-of-the-art perceptual metrics.
  • It employs a mixing-ratio-driven linear trajectory in the spectral domain and a neural mean-flow map to directly map mixtures to clean speech, yielding a +1.31 dB SI-SDR gain on Libri2Mix.
  • MeanFlow-TSE’s curriculum-based training and efficient architecture enable practical real-time deployment in streaming and edge applications with minimal computational overhead.

MeanFlow-TSE is a one-step generative target speaker extraction (TSE) framework based on mean-flow objectives. It is designed to efficiently and accurately extract a desired speaker’s voice from a multi-speaker mixture, while circumventing the computational burdens typical of diffusion and flow-matching approaches that require many iterative function evaluations. MeanFlow-TSE leverages a mixing-ratio-driven linear trajectory between background and target sources in the spectral domain and learns a neural mean-flow map, enabling direct, single-pass extraction of high-quality target speech. The system achieves real-time performance with competitive, state-of-the-art perceptual and fidelity metrics, as demonstrated on the Libri2Mix benchmark (Shimizu et al., 21 Dec 2025).

1. Problem Formulation and Background

In the TSE task, the observed signal is a mixture

y(t)=s(t)+b(t)y(t) = s(t) + b(t)

where s(t)s(t) is the target speaker and b(t)b(t) includes background and interfering speakers. Traditional discriminative approaches (e.g., Conv-TasNet with speaker embedding, SepFormer) estimate a mask or mapping fθ(y,e)s^f_\theta(y, e) \to \hat{s} that minimizes waveform losses (e.g., SI-SNR, s^s2\| \hat{s} - s \|^2). While fast, these can introduce artifacts and generalize poorly. Generative paradigms, including diffusion and flow-matching models, learn a conditional density p(sy,e)p(s|y,e), but require multi-step sampling—typically ≥10 network evaluations (NEFs)—limiting deployment in low-latency or real-time scenarios.

MeanFlow-TSE extends the "AD-FlowTSE" paradigm, which is anchored in modeling flows between background and target in the STFT (spectral) domain. It introduces a mixing ratio λ[0,1]\lambda \in [0,1] representing the balance between target and background: Y=λS+(1λ)B,Y = \lambda S + (1-\lambda) B, where Y=STFT(y)Y = \mathrm{STFT}(y), S=STFT(s)S = \mathrm{STFT}(s), B=STFT(b)B = \mathrm{STFT}(b).

2. One-Step Mean-Flow Objective

The method parameterizes the extraction path as a convex linear interpolation in the spectral space: zt=tS+(1t)B,t[0,1]z_t = t S + (1-t) B,\,\, t \in [0, 1] where tt governs the transition from background to the target. The instantaneous velocity is u=SBu = S-B, constant with respect to tt, reflecting the straight-line nature of the mixing path.

Instead of numerically integrating instantaneous velocities, MeanFlow-TSE learns the average velocity over the interval [t,r][t, r]: zr=zt+(rt)vavg(zt,t,r,e)z_r = z_t + (r-t) v_\text{avg}(z_t, t, r, e) For the TSE problem, at inference, t=λt = \lambda (estimated mixing ratio), r=1r = 1, and ee is the enrollment (reference speaker) embedding. The predicted target spectrogram is

S^=Y+(1λ^)vθ(Y,λ^,1,e)\hat{S} = Y + (1 - \hat{\lambda}) v_\theta(Y, \hat{\lambda}, 1, e)

with λ^\hat{\lambda} output by a learned mixing ratio predictor. This single-step update realizes direct source extraction without iterative refinement.

3. Training Protocol and Model Architecture

The framework employs the "α-Flow" training regime, which interpolates between rectified flow matching (α=1) and mean-flow self-consistency (α→0), introducing a curriculum for stabilized learning. The hybrid target velocity is

vt,rα=αu+(1α)vθ(zτ,τ,r,e)v_{t, r}^\alpha = \alpha u + (1-\alpha) v_\theta(z_\tau, \tau, r, e)

with τ=αr+(1α)t\tau = \alpha r + (1-\alpha) t. The per-sample adaptive-weighted loss is

Ladaptive(θ)=Et,r,S,B,e[wvθ(zt,t,r,e)vt,rα2]\mathcal{L}_\text{adaptive}(\theta) = \mathbb{E}_{t, r, S, B, e}\left[ w \cdot \| v_\theta(z_t, t, r, e) - v_{t, r}^\alpha \|^2 \right]

where w=α/(Δ2+c)w = \alpha / (\| \Delta \|^2 + c), Δ=vθ(zt,t,r,e)vt,rα\Delta = v_\theta(z_t, t, r, e) - v_{t, r}^\alpha, and c=103c = 10^{-3}.

Architecturally, the backbone is a U-Net-style Diffusion Transformer (UDiT) with 16 transformer layers (hidden dim 768) and frequency-length inputs (512 × 500). Speaker conditioning is incorporated via cross-attention to an enrollment utterance embedding (ECAPA-TDNN), continually fused within the UDiT. The mixing ratio predictor gϕg_\phi is a small MLP acting on concatenated ECAPA embeddings of the mixture and enrollment. Optimization utilizes AdamW with cosine annealing and mixed precision; gradient clipping ensures numerical stability.

4. Inference, Efficiency, and Complexity

Inference proceeds as follows:

  1. Short-time Fourier transform (STFT) is applied to yy, enrollment embedding is computed.
  2. The mixing ratio predictor outputs λ^\hat{\lambda}.
  3. The one-step update computes S^=Y+(1λ^)vθ(Y,λ^,1,e)\hat{S} = Y + (1-\hat{\lambda}) v_\theta(Y, \hat{\lambda}, 1, e).
  4. Inverse STFT reconstructs the waveform.

The framework requires only NFE=1 (one network evaluation per utterance), yielding real-time factor (RTF) ≈0.018 for 3 s audio on an NVIDIA L40 GPU. The model size is ≈359M parameters, with peak GPU memory ≈1.5 GB. Compared to diffusion and flow-matching baselines (e.g., NFE≥50, RTF≈0.75), computational overhead is negligible at similar or higher quality levels.

5. Empirical Performance and Ablation Studies

Evaluation on Libri2Mix employs intrusive metrics (SI-SDR, PESQ, ESTOI), non-intrusive measures (DNSMOS, OVRL), and speaker similarity (cosine-SIM).

Performance Comparison

Model NFE SI-SDR (dB) PESQ ESTOI
AD-FlowTSE 1 17.49 2.89 0.90
MeanFlow-TSE 1 18.80 3.26 0.93

MeanFlow-TSE achieves a +1.31 dB SI-SDR gain and similarly leads on perceptual metrics, both in clean and noisy settings.

Ablation Results

  • SI-SDR and PESQ peak at NFE=1; extra steps add only discretization error.
  • Removing the α curriculum (fixing α=1) reduces SI-SDR by ~0.7 dB—curriculum is necessary for stability.
  • The predicted mixing ratio λ^\hat{\lambda} approaches oracle performance, with <0.2 dB deficit.

6. Relationship to Other MeanFlow and Flow-Matching Methods

MeanFlow-TSE applies the central mean-flow principle: learning the average (not instantaneous) velocity of flow trajectories. This principle aligns with recent advances in one-step generative modeling for both image and audio domains. Comparable frameworks in speech enhancement (MeanFlowSE, MeanSE) show analogous efficiency–quality tradeoffs, requiring only a single network evaluation and yielding strong performance versus ODE/diffusion-based models (Li et al., 18 Sep 2025, Wang et al., 25 Sep 2025). In MeanFlow-TSE, the mixing-ratio-driven trajectory and curriculum-based training are specifically tailored to the TSE setting, directly mapping mixtures to clean target speech in a single pass.

7. Real-Time Applicability and Future Directions

MeanFlow-TSE is state-of-the-art in test-set SI-SDR, PESQ, ESTOI, and real-time factor among generative TSE frameworks. Its design enables deployment scenarios including streaming, hearing aids, and edge devices, due to minimal forward-pass latency and memory requirements. Future research aims to:

  • Extend the method to multi-channel and reverberant conditions (e.g. by conditioning flows on beamforming features).
  • Integrate metric-based fine-tuning, such as direct SI-SDR optimization.
  • Develop lighter-weight model variants for cost-constrained environments.

MeanFlow-TSE thus represents a substantial advance in efficient, high-fidelity, and practical generative target speaker extraction (Shimizu et al., 21 Dec 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MeanFlow-TSE.