Generative Speech Enhancement
- Generative Speech Enhancement is a class of data-driven methods that leverage advanced generative models to restore high-fidelity speech from degraded signals.
- It utilizes architectures such as language models, normalizing flows, and diffusion models to re-synthesize missing or corrupted speech components.
- Ongoing research focuses on mitigating hallucinations, ensuring speaker consistency, and improving computational efficiency for real-world applications.
Generative Speech Enhancement (GSE) encompasses a class of data-driven methodologies designed to recover high-fidelity clean speech waveforms from degraded, noisy, or otherwise corrupted signals by learning the underlying distribution of clean speech and modeling the conditional mapping to observable noisy inputs. Unlike classical discriminative or regression-based enhancement methods, GSE leverages expressive generative models—such as LLMs, normalizing flows, diffusion processes, and GANs—that parameterize intricate priors over the speech manifold and enable the synthesis of plausible outputs even when large portions of the input are masked or missing. As of 2026, GSE has demonstrated state-of-the-art performance for a variety of real-world signal distortions, including additive noise, reverberation, clipping, frequency loss, and content dropouts. The domain is rapidly evolving, with ongoing research tackling issues of hallucinations, speaker consistency, generalization, and computational efficiency.
1. Core Architectures and Generative Modeling Paradigms
GSE systems are constructed on a range of advanced generative modeling frameworks:
- Discrete Token-based LLMs: Models such as Genhancer and GenSE treat enhancement as a conditional sequence generation problem, where the noisy waveform is mapped via a feature extractor (typically a neural audio codec, e.g., DAC) and a tokenizer to RVQ codebooks. The generative model is typically an autoregressive Transformer or LM that learns over discrete token sequences, often with hierarchical codebooks to balance perceptual fidelity and timbre consistency (Yamauchi et al., 18 Jan 2026, Yao et al., 5 Feb 2025).
- Normalizing Flows and GAN Hybrids: SEFGAN employs invertible flow-based mappings trained under maximum likelihood and adversarial objectives. Conditioning networks (e.g., condNet) and multi-scale discriminators enhance realism and promote consistency, allowing tractable density estimation and efficient inference. Flow-based approaches such as MeanSE and MeanFlowSE propose direct prediction of interval-averaged velocity fields, enabling one-step refinement and substantial reductions in latency (Strauss et al., 2023, Wang et al., 25 Sep 2025, Zhu et al., 27 Sep 2025).
- Diffusion and Schrödinger Bridge Models: Latent diffusion transformers (DiTSE) and score-based models implement multi-step stochastic or ODE-based denoising over low-dimensional latent spaces or STFT domains. Schrödinger Bridge methods formalize enhancement as optimal distribution transport, circumventing prior mismatch and enabling few-step or even one-step inference through adversarial GAN integration (SB-UFOGen) (Guimarães et al., 13 Apr 2025, Jukić et al., 2024, Han et al., 2 Jun 2025).
- Embedding-based and Codec-Driven Models: Efficient GSE pipelines leverage pre-trained generative audioencoders (e.g., Dasheng, WavLM) to extract robust representations, followed by lightweight denoising encoders and differentiable vocoders trained via reconstruction and adversarial losses (Sun et al., 13 Jun 2025).
- GAN Architectures with Domain-Specific Priors: Time-domain and TF-domain GANs integrate specialized generator architectures (complex convolutional recurrent networks, two-stage Conformers) and metric-driven discriminators (e.g., regression on perceptual scores such as PESQ) to optimize enhancement for both naturalness and signal fidelity, yielding edge-optimal models like WSR-MGAN and CMGAN (Pal et al., 2024, Cao et al., 2022, Huang et al., 2020, Pascual et al., 2019).
2. Hallucination Phenomena and Error Metrics
Generative models display characteristic hallucination errors:
- Linguistic Hallucinations: Phoneme omissions, insertions, and semantic drift can yield syntactically plausible but incorrect utterances.
- Acoustic Hallucinations: Speaker inconsistencies, non-natural timbre shifts, and prosodic deviations degrade perceptual identity.
- These effects are not reliably detected by traditional non-intrusive metrics (DNSMOS, UTMOS, ASR confidence, CTC score), which correlate poorly with content and identity corruption.
To address this, recent research advocates explicit hallucination-aware metrics:
- Confidence-based Filtering: The log-probabilities are averaged to form an utterance-level score , providing a non-intrusive proxy for fidelity and correctness, with strikingly higher SRCC to intrusive metrics (ESTOI, SI-SDR, PESQ, SpeechBERTScore, Levenshtein phoneme similarity, WAcc, speaker similarity) than legacy methods (Yamauchi et al., 18 Jan 2026).
- Reference-aware and semantic metrics: Word Error Rate (WER), normalized phoneme edit distance (LPS), and speaker similarity (cosine embedding over RawNet3/ECAPA-TDNN systems) quantify content and acoustic preservation under severe distortions (Rong et al., 17 Nov 2025, Guimarães et al., 13 Apr 2025).
3. Techniques for Hallucination Mitigation and Data Curation
Effective suppression and detection of hallucination errors are achieved via several architectural and algorithmic innovations:
- Phonological Priors via Representation Distillation: PASE fine-tunes a student SSL model (WavLM) against clean teacher outputs to anchor enhanced features in a robust phonological manifold, minimizing linguistic hallucinations without learning from contaminated noisy tokens (Rong et al., 17 Nov 2025).
- Dual-Stream Vocoders: Separate conditioning streams for phonetic (high-level) and acoustic (low-level, speaker/prosody cues) information explicitly preserve both content and speaker identity; summation or projection mechanisms enable robust fusion for waveform synthesis (Rong et al., 17 Nov 2025).
- Confidence-based Filtering in Corpus Curation: By thresholding or retaining top- utterances, GSE systems filter out hallucinated outputs, improving downstream TTS model performance (UTMOS↑, DNSMOS↑, WER↓), as empirically demonstrated on large in-the-wild datasets (TITW-hard) (Yamauchi et al., 18 Jan 2026).
- Hierarchical LM and token chain prompting: GenSE and OmniGSE utilize multi-stage LMs and prompt streams (semantic+acoustic tokens) to enhance stability and timbre consistency, outperforming SOTA SE across DSLMOS, SECS, VQScore, and WER (Yao et al., 5 Feb 2025, Mu et al., 25 Jul 2025).
4. Experimental Validation and Performance Benchmarks
GSE frameworks are evaluated across diverse corpora (LibriTTS, DNS-Challenge, WHAMR, EARS-WHAM, VCTK, HiFi-TTS, VoiceBank-DEMAND) and distorted domains (packet loss, clipping, network artifacts, reverberation):
- Correlation and Quality Analysis: Confidence-based filtering achieves SRCC up to 0.88 with ESTOI, 0.883 with PESQ, 0.89 with SpeechBERTScore (EARS-WHAM). Filtering at 70–80% retention yields ΔUTMOS ≈ +0.16–0.20, ΔWER ≈ –1.6–3.2 percentage points (Yamauchi et al., 18 Jan 2026).
- Comparative Experiments: PASE demonstrates top performance across OVRL, SIG, BAK, SBS, LPS, and SpkSim, reducing WER by >50% and doubling speaker similarity over generative competitors. Ablations confirm the necessity of the phonological prior and dual acoustic conditioning (Rong et al., 17 Nov 2025).
- Generalization: MeanSE and MeanFlowSE show superior out-of-domain robustness (WHAMR!), maintaining high PESQ/ESTOI/BAK/OVRL at 1-NFE (one-step): MeanSE achieves DNSMOS 2.148, UTMOS 1.924, NISQA 2.523 (Wang et al., 25 Sep 2025, Zhu et al., 27 Sep 2025).
- Efficiency: One-step and embedding-based models (MeanFlowSE, Dasheng+ViT₃, WSR-MGAN-lite) achieve real-time factors 0.02 and parameter budgets 2M, suitable for low-power/edge deployment, while matching quality of multi-step and large-scale architectures (Zhu et al., 27 Sep 2025, Sun et al., 13 Jun 2025, Pal et al., 2024).
5. Applications to Downstream Tasks and Practical Implications
GSE approaches are foundational in:
- Text-to-Speech (TTS) dataset curation: Enhanced, confidence-filtered corpora yield TTS models with improved MOS and intelligibility (UTMOS, DNSMOS, ASR WER), outperforming models trained on noisy or unfiltered data (Yamauchi et al., 18 Jan 2026).
- General Restoration: OmniGSE unifies denoising, dereverberation, super-resolution, and packet-loss concealment under hierarchical LM modeling, exhibiting SOTA DNSMOS, NISQA, PLCMOS in subjective and objective comparisons.
- Speech separation, echo cancellation, and bandwidth extension: LLaSE-G1, CMGAN, and SEFGAN demonstrate capacity for unifying single-input and dual-input SE subtasks with consistent quality and generalization (Kang et al., 1 Mar 2025, Cao et al., 2022, Strauss et al., 2023).
- Real-time and low-resource scenarios: WSR-MGAN and embedding-denoiser frameworks enable high-quality enhancement on resource-constrained hardware, making GSE practical for embedded and edge devices (Pal et al., 2024, Sun et al., 13 Jun 2025).
6. Current Limitations and Future Directions
Contemporary challenges and open lines of research include:
- Residual errors and adaptation: One-step refinement models may struggle in extremely noisy/unseen domains, motivating exploration of adaptive fine-tuning and hybrid iterative-generative pipelines (Zhu et al., 27 Sep 2025).
- Hallucination risk in continuous-latent models: Current confidence-based filtering is effective for discrete token LMs; extending non-intrusive error detection to continuous-latent GSE requires new approaches, e.g., likelihood-based confidence in feature space (Yamauchi et al., 18 Jan 2026).
- Model footprint and decoding latency: Hierarchical LMs and deep Transformers (OmniGSE, DiTSE) maintain significant parameter and latency costs; advances in non-autoregressive, distillation, and parallel decoding are expected to yield further improvements (Mu et al., 25 Jul 2025, Guimarães et al., 13 Apr 2025).
- Unpaired and universal enhancement: Training and inference on unpaired or completely out-of-domain data remains an open problem, especially for GAN-adversarial SB and flow models (Han et al., 2 Jun 2025).
- Integration with multimodal and adaptive systems: Incorporation of visual context, automatic domain switching, and universal codec support are promising directions for next-generation GSE frameworks (Mu et al., 25 Jul 2025).
Generative Speech Enhancement is rapidly displacing classical and purely discriminative approaches, providing a principled foundation for high-quality speech reconstruction, hallucination mitigation, and robust dataset curation across a spectrum of real-world application domains. The field continues to advance toward unified, efficient, and fidelity-preserving enhancement under extreme and compound distortions.