Exploring Speech Enhancement for Low-resource Speech Synthesis
Abstract: High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive. Applying speech enhancement on Automatic Speech Recognition (ASR) corpus mitigates the issue by augmenting the training data, while how the nonlinear speech distortion brought by speech enhancement models affects TTS training still needs to be investigated. In this paper, we train a TF-GridNet speech enhancement model and apply it to low-resource datasets that were collected for the ASR task, then train a discrete unit based TTS model on the enhanced speech. We use Arabic datasets as an example and show that the proposed pipeline significantly improves the low-resource TTS system compared with other baseline methods in terms of ASR WER metric. We also run empirical analysis on the correlation between speech enhancement and TTS performances.
- A. Hunt and A. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in ICASSP. IEEE, 1996.
- “Statistical parametric speech synthesis,” in ICASSP. IEEE, 2007.
- “Neural speech synthesis with transformer network,” in AAAI, 2019.
- “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
- “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in ICASSP. IEEE, 2018.
- D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on acoustics, speech, and signal processing, 1984.
- “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP. IEEE, 2019.
- “Efficient neural audio synthesis,” in ICML. PMLR, 2018.
- “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” NeurIPS, 2020.
- “Fastspeech: Fast, robust and controllable text to speech,” NeurIPS, 2019.
- “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.
- “Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks,” arXiv preprint arXiv:1904.07556, 2019.
- “Speech resynthesis from discrete disentangled self-supervised representations,” arXiv preprint arXiv:2104.00355, 2021.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” NeurIPS, 2020.
- “Can we use common voice to train a multi-speaker tts system?,” in SLT. IEEE, 2023.
- E. Cooper and X. Wang, “Utterance selection for optimizing intelligibility of tts voices trained on asr data,” Interspeech, 2017.
- “Espnet-se++: Speech enhancement for robust speech recognition, translation, and understanding,” arXiv preprint arXiv:2207.09514, 2022.
- “fairseq s^2: A scalable and integrable speech synthesis toolkit,” arXiv preprint arXiv:2109.06912, 2021.
- “Robust speech recognition with speech enhanced deep neural networks,” in Interspeech, 2014.
- “Neural spatio-temporal beamformer for target speech separation,” arXiv preprint arXiv:2005.03889, 2020.
- “Tf-gridnet: Integrating full-and sub-band modeling for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- Seamless Communication, “Seamlessm4t-massively multilingual and multimodal machine translation,” 2023.
- “Real time speech enhancement in the waveform domain,” in Interspeech, 2020.
- “Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement,” in ICASSP. IEEE, 2021.
- “Frcrn: Boosting feature representation using frequency recurrence for monaural speech enhancement,” in ICASSP. IEEE, 2022.
- S. Zhao and B. Ma, “D2former: A fully complex dual-path dual-decoder conformer network using joint complex masking and complex spectral mapping for monaural speech enhancement,” in ICASSP. IEEE, 2023.
- “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” arXiv preprint arXiv:2111.09296, 2021.
- Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, 2019.
- “Complex ratio masking for monaural speech separation,” IEEE/ACM transactions on audio, speech, and language processing, 2015.
- “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” arXiv preprint arXiv:2005.13981, 2020.
- “Common voice: A massively-multilingual speech corpus,” in LREC, 2020.
- “Masc: Massive arabic speech corpus,” in SLT. IEEE, 2023.
- “The multilingual tedx corpus for speech recognition and translation,” arXiv preprint arXiv:2102.01757, 2021.
- “Mediaspeech: Multilanguage asr benchmark and dataset,” arXiv preprint arXiv:2103.16193, 2021.
- “Fleurs: Few-shot learning evaluation of universal representations of speech,” arXiv preprint arXiv:2205.12446, 2022.
- “Sdr–half-baked or well done?,” in ICASSP. IEEE, 2019.
- “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP. IEEE, 2001, vol. 2, pp. 749–752.
- “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, 2011.
- “Robust speech recognition via large-scale weak supervision,” in ICML. PMLR, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.