Assessing GSE performance for in-the-wild TTS dataset curation

Determine how well generative speech enhancement models perform for curating in-the-wild text-to-speech datasets in more challenging real-world scenarios beyond curated corpora such as LibriTTS.

Background

Generative speech enhancement (GSE) has been used to clean speech datasets for TTS, notably with Miipher to produce LibriTTS-R from LibriTTS. However, LibriTTS is not an in-the-wild dataset, leaving uncertainty about how well such methods generalize to noisier, more variable real-world data.

Understanding GSE performance on truly in-the-wild datasets is critical for reliable dataset curation pipelines, where enhancement errors such as hallucinations can degrade downstream TTS performance. This paper highlights the gap and investigates the effectiveness of GSE in these settings.

References

While this work shows the potential of GSE for TTS dataset curation, LibriTTS is not an in-the-wild dataset. Consequently, it remains unclear how well GSE performs in more challenging in-the-wild scenarios.

Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens  (2601.12254 - Yamauchi et al., 18 Jan 2026) in Section 3, Related works (TTS dataset curation with SE)