Papers
Topics
Authors
Recent
Search
2000 character limit reached

MeanFlow-TSE: One-Step Target Speaker Extraction

Updated 24 December 2025
  • The paper introduces MeanFlow-TSE, a one-step generative TSE framework that efficiently extracts target speech using a single network evaluation and achieves state-of-the-art perceptual metrics.
  • It employs a mixing-ratio-driven linear trajectory in the spectral domain and a neural mean-flow map to directly map mixtures to clean speech, yielding a +1.31 dB SI-SDR gain on Libri2Mix.
  • MeanFlow-TSE’s curriculum-based training and efficient architecture enable practical real-time deployment in streaming and edge applications with minimal computational overhead.

MeanFlow-TSE is a one-step generative target speaker extraction (TSE) framework based on mean-flow objectives. It is designed to efficiently and accurately extract a desired speaker’s voice from a multi-speaker mixture, while circumventing the computational burdens typical of diffusion and flow-matching approaches that require many iterative function evaluations. MeanFlow-TSE leverages a mixing-ratio-driven linear trajectory between background and target sources in the spectral domain and learns a neural mean-flow map, enabling direct, single-pass extraction of high-quality target speech. The system achieves real-time performance with competitive, state-of-the-art perceptual and fidelity metrics, as demonstrated on the Libri2Mix benchmark (Shimizu et al., 21 Dec 2025).

1. Problem Formulation and Background

In the TSE task, the observed signal is a mixture

y(t)=s(t)+b(t)y(t) = s(t) + b(t)

where s(t)s(t) is the target speaker and b(t)b(t) includes background and interfering speakers. Traditional discriminative approaches (e.g., Conv-TasNet with speaker embedding, SepFormer) estimate a mask or mapping fθ(y,e)s^f_\theta(y, e) \to \hat{s} that minimizes waveform losses (e.g., SI-SNR, s^s2\| \hat{s} - s \|^2). While fast, these can introduce artifacts and generalize poorly. Generative paradigms, including diffusion and flow-matching models, learn a conditional density p(sy,e)p(s|y,e), but require multi-step sampling—typically ≥10 network evaluations (NEFs)—limiting deployment in low-latency or real-time scenarios.

MeanFlow-TSE extends the "AD-FlowTSE" paradigm, which is anchored in modeling flows between background and target in the STFT (spectral) domain. It introduces a mixing ratio λ[0,1]\lambda \in [0,1] representing the balance between target and background: Y=λS+(1λ)B,Y = \lambda S + (1-\lambda) B, where Y=STFT(y)Y = \mathrm{STFT}(y), S=STFT(s)S = \mathrm{STFT}(s), s(t)s(t)0.

2. One-Step Mean-Flow Objective

The method parameterizes the extraction path as a convex linear interpolation in the spectral space: s(t)s(t)1 where s(t)s(t)2 governs the transition from background to the target. The instantaneous velocity is s(t)s(t)3, constant with respect to s(t)s(t)4, reflecting the straight-line nature of the mixing path.

Instead of numerically integrating instantaneous velocities, MeanFlow-TSE learns the average velocity over the interval s(t)s(t)5: s(t)s(t)6 For the TSE problem, at inference, s(t)s(t)7 (estimated mixing ratio), s(t)s(t)8, and s(t)s(t)9 is the enrollment (reference speaker) embedding. The predicted target spectrogram is

b(t)b(t)0

with b(t)b(t)1 output by a learned mixing ratio predictor. This single-step update realizes direct source extraction without iterative refinement.

3. Training Protocol and Model Architecture

The framework employs the "α-Flow" training regime, which interpolates between rectified flow matching (α=1) and mean-flow self-consistency (α→0), introducing a curriculum for stabilized learning. The hybrid target velocity is

b(t)b(t)2

with b(t)b(t)3. The per-sample adaptive-weighted loss is

b(t)b(t)4

where b(t)b(t)5, b(t)b(t)6, and b(t)b(t)7.

Architecturally, the backbone is a U-Net-style Diffusion Transformer (UDiT) with 16 transformer layers (hidden dim 768) and frequency-length inputs (512 × 500). Speaker conditioning is incorporated via cross-attention to an enrollment utterance embedding (ECAPA-TDNN), continually fused within the UDiT. The mixing ratio predictor b(t)b(t)8 is a small MLP acting on concatenated ECAPA embeddings of the mixture and enrollment. Optimization utilizes AdamW with cosine annealing and mixed precision; gradient clipping ensures numerical stability.

4. Inference, Efficiency, and Complexity

Inference proceeds as follows:

  1. Short-time Fourier transform (STFT) is applied to b(t)b(t)9, enrollment embedding is computed.
  2. The mixing ratio predictor outputs fθ(y,e)s^f_\theta(y, e) \to \hat{s}0.
  3. The one-step update computes fθ(y,e)s^f_\theta(y, e) \to \hat{s}1.
  4. Inverse STFT reconstructs the waveform.

The framework requires only NFE=1 (one network evaluation per utterance), yielding real-time factor (RTF) ≈0.018 for 3 s audio on an NVIDIA L40 GPU. The model size is ≈359M parameters, with peak GPU memory ≈1.5 GB. Compared to diffusion and flow-matching baselines (e.g., NFE≥50, RTF≈0.75), computational overhead is negligible at similar or higher quality levels.

5. Empirical Performance and Ablation Studies

Evaluation on Libri2Mix employs intrusive metrics (SI-SDR, PESQ, ESTOI), non-intrusive measures (DNSMOS, OVRL), and speaker similarity (cosine-SIM).

Performance Comparison

Model NFE SI-SDR (dB) PESQ ESTOI
AD-FlowTSE 1 17.49 2.89 0.90
MeanFlow-TSE 1 18.80 3.26 0.93

MeanFlow-TSE achieves a +1.31 dB SI-SDR gain and similarly leads on perceptual metrics, both in clean and noisy settings.

Ablation Results

  • SI-SDR and PESQ peak at NFE=1; extra steps add only discretization error.
  • Removing the α curriculum (fixing α=1) reduces SI-SDR by ~0.7 dB—curriculum is necessary for stability.
  • The predicted mixing ratio fθ(y,e)s^f_\theta(y, e) \to \hat{s}2 approaches oracle performance, with <0.2 dB deficit.

6. Relationship to Other MeanFlow and Flow-Matching Methods

MeanFlow-TSE applies the central mean-flow principle: learning the average (not instantaneous) velocity of flow trajectories. This principle aligns with recent advances in one-step generative modeling for both image and audio domains. Comparable frameworks in speech enhancement (MeanFlowSE, MeanSE) show analogous efficiency–quality tradeoffs, requiring only a single network evaluation and yielding strong performance versus ODE/diffusion-based models (Li et al., 18 Sep 2025, Wang et al., 25 Sep 2025). In MeanFlow-TSE, the mixing-ratio-driven trajectory and curriculum-based training are specifically tailored to the TSE setting, directly mapping mixtures to clean target speech in a single pass.

7. Real-Time Applicability and Future Directions

MeanFlow-TSE is state-of-the-art in test-set SI-SDR, PESQ, ESTOI, and real-time factor among generative TSE frameworks. Its design enables deployment scenarios including streaming, hearing aids, and edge devices, due to minimal forward-pass latency and memory requirements. Future research aims to:

  • Extend the method to multi-channel and reverberant conditions (e.g. by conditioning flows on beamforming features).
  • Integrate metric-based fine-tuning, such as direct SI-SDR optimization.
  • Develop lighter-weight model variants for cost-constrained environments.

MeanFlow-TSE thus represents a substantial advance in efficient, high-fidelity, and practical generative target speaker extraction (Shimizu et al., 21 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MeanFlow-TSE.