Flow-Matching Speech Enhancement

Updated 20 February 2026

Flow-matching speech enhancement is a generative method that employs ODEs and learned velocity fields to transport degraded speech to clean audio in a single computational step.
It synthesizes conditional normalizing flows with transport theory to optimize latent space trajectories, significantly reducing inference time while maintaining high perceptual quality.
Applications include audio-only, audio-visual, semantic-guided, and real-time enhancements, achieving empirical superiority over traditional diffusion-based methods.

Flow-matching speech enhancement refers to a family of generative methods leveraging ordinary differential equations (ODEs) and learned velocity fields to transport degraded speech (audio or features) to clean speech, via conditional paths in latent space. As an overview of conditional normalizing flows and transport-theoretic ideas, flow matching (FM) is distinguished by its ability to produce high-fidelity enhanced speech in very few computational steps—often just one network evaluation—markedly reducing the computational cost typical of diffusion-based or score-based generative approaches. The rapid evolution of the field in 2024–2026 has produced both general-purpose and highly specialized models for acoustic, audio-visual, semantic-guided, and ASR-robust enhancement, achieving broad empirical superiority over prior baselines in both speed and perceptual quality.

1. Mathematical Foundations of Flow-Matching Speech Enhancement

Flow-matching speech enhancement is grounded in the formulation of continuous-time normalizing flows (CNFs) and more specifically, conditional flow matching (CFM) between paired endpoints in data space. The key technical principle is to define a time-parametrized distributional path $p_t(x)$ connecting a tractable source (e.g., noise or model prediction) to the target clean speech distribution $q(x)$ , and to learn a velocity field $v_\theta(x, t, c)$ (with possible conditioning c, e.g. noisy audio, visual context, etc.) such that the ordinary differential equation

$\frac{dx_t}{dt} = v_\theta(x_t, t, c)$

transports samples along the path from degraded to clean speech (Lee et al., 9 Aug 2025, Wang et al., 26 May 2025, Jung et al., 2024). For speech restoration, the distributional path is typically chosen as a (conditional) Gaussian bridge, with mean and variance functions that may be linear or non-linear in t, but straight-line, constant-variance paths have been empirically established as providing optimal trade-offs between training tractability, generalization, and inference efficiency (Cross et al., 28 Aug 2025).

Conditional flow matching exploits the closed-form analytical velocity field for simple Gaussian conditional paths. In the most common (optimal-transport) setting, for a linear mean path $\mu_t(x_0, x_1) = (1-t)x_0 + t x_1$ and constant variance, the optimal drift is $u_t(x|x_0, x_1) = x_1 - x_0$ , reducing the learning problem to direct regression onto a known target velocity field (Cross et al., 28 Aug 2025). For multi-modal or more complex settings, average or composed velocity fields have been introduced (Yang et al., 19 Sep 2025, Wang et al., 25 Sep 2025) to enable robust one-step inference.

2. Model Architectures, Conditioning, and Network Backbones

A broad range of architectures have been deployed for FM-based speech enhancement, tailored to the conditioning modalities, computational requirements, and target deployment scenarios:

Audio-only enhancement: Models such as FlowSE and COSE are typically instantiated as lightweight or full-scale U-Net (NCSN++ or NCSN++M) backbones operating on magnitude or complex STFT features. Conditioning on the noisy input is implemented via concatenation, FiLM, or cross-attention (Lee et al., 9 Aug 2025, Wang et al., 26 May 2025, Yang et al., 19 Sep 2025).
Audio-visual enhancement: Systems like FlowAVSE use a two-stage architecture comprising a dedicated visual encoder (e.g., 3D-CNNs + ResNet-18) to extract visual embeddings from face videos, and two parallel light U-Nets (predictor and flow-matching refiner) fused by cross-attention at multiple scales (Jung et al., 2024).
Text/semantic-aware and prompt-guided models: SenSE introduces a semantic-aware speech LLM (SASLM) to extract semantic tokens from degraded audio, combing these with masked acoustic features and optional speaker prompts via ConvNeXt-V2 and U-shaped Transformer blocks (DiT). Semantic tokens are injected through channel-wise concatenation and FiLM/cross-attention conditioning throughout the flow model (Li et al., 29 Sep 2025).
Transformer-based architectures: Large-scale models such as VoiceRestore and SpeechFlow employ deep Transformers for direct ODE velocity prediction, often pre-trained on massive unpaired datasets with synthetic degradations (Kirdey, 1 Jan 2025, Liu et al., 2023).
Latent-level enhancement: FM-Refiner operates in the latent space of a frozen ASR encoder (CTC-based), using a U-Net with time-injected residual blocks to refine noisy latents prior to decoding (Yang et al., 8 Jan 2026).
Causal/real-time deployment: Flow-matching has also been adapted for real-time, low-latency streaming by enforcing causal architectures (e.g., causal NCSN++ U-Nets, ConvGLU-UNet with stride-1 processing in time), achieving sub-20 ms algorithmic latency (Hsieh et al., 19 Oct 2025).

3. Inference Efficiency: Single-Step, Few-Step, and Average-Velocity Flows

A defining operational advantage of flow-matching speech enhancement is the capability for high-quality single- or few-step inference:

Single-step (Direct Data Prediction, DDP): For strictly linear paths with constant drift and variance (straight paths), the ODE integral can be solved in one step, and enhancement reduces to $\hat x = y + F_\theta(y, y, 1)$ , obviating multi-step trajectory simulation (Cross et al., 28 Aug 2025, Jung et al., 2024).
Average-velocity flows and self-consistency: MeanSE and COSE train networks to predict the average velocity over an interval, making the one Euler step update $\hat x_0 = x_1 - u_\theta(x_1,0,1, y)$ theoretically sound and empirically robust, while velocity composition identities (i.e., affine interpolation semigroup) guarantee self-consistent one-shot flows (Yang et al., 19 Sep 2025, Wang et al., 25 Sep 2025).
Few-step trade-offs and stability: For low-SNR or strongly mismatched conditions, multi-step ODE solvers (e.g., 2–8 Euler steps) allow recovery of samples closer to the clean manifold while maintaining a tight computational bound (cf. NFE=5 in FlowSE and CTFSE) (Lee et al., 9 Aug 2025, Lee et al., 9 Aug 2025).
Empirical performance: On VoiceBank-DEMAND and related benchmarks, single-step DDP and one-step COSE/MeanSE match or exceed 60-step diffusion methods in PESQ (up to 3.02), ESTOI (up to 0.88), and SI-SDR (up to 20.4 dB), while inference time is reduced by more than an order of magnitude (Cross et al., 28 Aug 2025, Yang et al., 19 Sep 2025, Jung et al., 2024).

4. Conditioning Paradigms: Multimodal, Semantic, Text, and Latent

Conditioning is central to the performance and generalization properties of flow-matching speech enhancement. The following modalities are prominent:

Audio-visual: Visual embeddings enhance robustness in high-noise or challenging scenarios, with FlowAVSE demonstrating effective use of cross-modal fusion and joint optimization for predictor and flow stages (Jung et al., 2024).
Textual/semantic guidance: SpeechFlow and SenSE incorporate text or semantic representations. SpeechFlow uses masked conditioning, allowing for classifier-free or text-guided enhancement, crucial for missing or ambiguous speech (Liu et al., 2023, Wang et al., 26 May 2025, Li et al., 29 Sep 2025).
Task-agnostic/Front-end refinement: SpeechRefiner is designed as a domain-agnostic post-processor, using CFM to refine outputs of arbitrary front-end enhancement, dereverberation, or separation pipelines by conditioning on the preprocessed spectral features (Li et al., 16 Jun 2025).
Latent-level operations: FM-Refiner bypasses input waveform enhancement and operates directly on the output of the ASR encoder, yielding reductions in word error rate even when upstream SE is already applied (Yang et al., 8 Jan 2026).

5. Empirical Results: Benchmarks, Metrics, and Comparative Analyses

Empirical evaluation across a spectrum of standard datasets and metrics consistently demonstrates the efficiency and effectiveness of FM-based SE:

Method	Steps (NFE)	PESQ	ESTOI	SI-SDR (dB)	RTF	Notable Attributes
FlowAVSE	1	2.10	0.796	13.37	0.085	AV, 22× faster than diffusion
COSE	1	3.02	0.87	19.3	—	One-step, JVP-free average-velocity field
MeanSE	1	2.09	0.80	—	—	Stable OOD generalization
FlowSE	5	3.03	0.93	18.71	—	Matches/Exceeds diffusion@60 steps
FM-Refiner	3	—	—	—	—	Latent-level, reduces ASR WER by ~10%
CTFSE	6	3.12	0.94	19.37	—	Cascaded two-flow, shared weights
DDP (ICFM)	1	3.00	0.88	20.4	—	Empirically best single-step, straight path

Enhanced models, including SenSE and SpeechFlow, further improve upon these figures when semantic/textual information can be incorporated (Li et al., 29 Sep 2025, Liu et al., 2023).

6. Theoretical Considerations and Design Insights

Research across multiple studies demonstrates that straight, time-independent probability paths provide superior empirical performance and efficiency compared to curved or time-dependent paths used in, e.g., Schrödinger bridge SE. Ablation shows that constant variance schedules contribute more to quality gains than linearization of the drift; the theoretical basis for single-step inference lies in the linearity and deterministic nature of the optimal-transport path and its associated ODE (Cross et al., 28 Aug 2025). Extension to average-velocity or composition-consistent networks enables robust one-step generative flows (Yang et al., 19 Sep 2025, Wang et al., 25 Sep 2025).

Stable training is facilitated by preconditioning (EDM-style scaling), curriculum on interval length, and flow-field mix-up. Conditioning paradigms benefit from classifier-free guidance (for text/audio-free and guided operation), semantic token integration, and prompt-based speaker guidance in complex scenarios (Wang et al., 26 May 2025, Li et al., 29 Sep 2025).

7. Limitations, Extensions, and Future Directions

While flow-matching SE demonstrates state-of-the-art efficiency and quality across modalities, several limitations and extensions are ongoing research:

Pathological regimes: At extremely low SNR or with poor predictors/conditions, single-step flows may be insufficient—hybrid multi-step or semi-implicit ODE solvers are proposed (Jung et al., 2024, Lee et al., 9 Aug 2025).
Model architecture: Large transformers and U-Nets can challenge inferencing on low-resource or edge devices; lighter architectures and quantization/compression are targets for future work (Wang et al., 26 May 2025, Hsieh et al., 19 Oct 2025).
Latent domain SE: Extending latent-level enhancement to non-CTC ASR systems and self-supervised representations remains open (Yang et al., 8 Jan 2026).
Reinforcement learning alignment: Fine-tuning with online RL (GRPO) for perceptual and multi-metric optimization is now demonstrated, but reward hacking and balancing remains a research challenge (Wang et al., 23 Jan 2026).
Noise robustness and OOD generalization: Mean flows, shortcut consistency, and semantic guidance strategies are effective for generalization, but further investigation into rare or pathological distortion scenarios is ongoing (Yang et al., 19 Sep 2025, Wang et al., 25 Sep 2025, Li et al., 29 Sep 2025).