Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exploring Speech Enhancement for Low-resource Speech Synthesis

Published 19 Sep 2023 in eess.AS | (2309.10795v1)

Abstract: High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive. Applying speech enhancement on Automatic Speech Recognition (ASR) corpus mitigates the issue by augmenting the training data, while how the nonlinear speech distortion brought by speech enhancement models affects TTS training still needs to be investigated. In this paper, we train a TF-GridNet speech enhancement model and apply it to low-resource datasets that were collected for the ASR task, then train a discrete unit based TTS model on the enhanced speech. We use Arabic datasets as an example and show that the proposed pipeline significantly improves the low-resource TTS system compared with other baseline methods in terms of ASR WER metric. We also run empirical analysis on the correlation between speech enhancement and TTS performances.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. A. Hunt and A. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in ICASSP. IEEE, 1996.
  2. “Statistical parametric speech synthesis,” in ICASSP. IEEE, 2007.
  3. “Neural speech synthesis with transformer network,” in AAAI, 2019.
  4. “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
  5. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in ICASSP. IEEE, 2018.
  6. D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on acoustics, speech, and signal processing, 1984.
  7. “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP. IEEE, 2019.
  8. “Efficient neural audio synthesis,” in ICML. PMLR, 2018.
  9. “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” NeurIPS, 2020.
  10. “Fastspeech: Fast, robust and controllable text to speech,” NeurIPS, 2019.
  11. “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.
  12. “Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks,” arXiv preprint arXiv:1904.07556, 2019.
  13. “Speech resynthesis from discrete disentangled self-supervised representations,” arXiv preprint arXiv:2104.00355, 2021.
  14. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” NeurIPS, 2020.
  15. “Can we use common voice to train a multi-speaker tts system?,” in SLT. IEEE, 2023.
  16. E. Cooper and X. Wang, “Utterance selection for optimizing intelligibility of tts voices trained on asr data,” Interspeech, 2017.
  17. “Espnet-se++: Speech enhancement for robust speech recognition, translation, and understanding,” arXiv preprint arXiv:2207.09514, 2022.
  18. “fairseq s^2: A scalable and integrable speech synthesis toolkit,” arXiv preprint arXiv:2109.06912, 2021.
  19. “Robust speech recognition with speech enhanced deep neural networks,” in Interspeech, 2014.
  20. “Neural spatio-temporal beamformer for target speech separation,” arXiv preprint arXiv:2005.03889, 2020.
  21. “Tf-gridnet: Integrating full-and sub-band modeling for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  22. Seamless Communication, “Seamlessm4t-massively multilingual and multimodal machine translation,” 2023.
  23. “Real time speech enhancement in the waveform domain,” in Interspeech, 2020.
  24. “Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement,” in ICASSP. IEEE, 2021.
  25. “Frcrn: Boosting feature representation using frequency recurrence for monaural speech enhancement,” in ICASSP. IEEE, 2022.
  26. S. Zhao and B. Ma, “D2former: A fully complex dual-path dual-decoder conformer network using joint complex masking and complex spectral mapping for monaural speech enhancement,” in ICASSP. IEEE, 2023.
  27. “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” arXiv preprint arXiv:2111.09296, 2021.
  28. Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, 2019.
  29. “Complex ratio masking for monaural speech separation,” IEEE/ACM transactions on audio, speech, and language processing, 2015.
  30. “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” arXiv preprint arXiv:2005.13981, 2020.
  31. “Common voice: A massively-multilingual speech corpus,” in LREC, 2020.
  32. “Masc: Massive arabic speech corpus,” in SLT. IEEE, 2023.
  33. “The multilingual tedx corpus for speech recognition and translation,” arXiv preprint arXiv:2102.01757, 2021.
  34. “Mediaspeech: Multilanguage asr benchmark and dataset,” arXiv preprint arXiv:2103.16193, 2021.
  35. “Fleurs: Few-shot learning evaluation of universal representations of speech,” arXiv preprint arXiv:2205.12446, 2022.
  36. “Sdr–half-baked or well done?,” in ICASSP. IEEE, 2019.
  37. “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP. IEEE, 2001, vol. 2, pp. 749–752.
  38. “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, 2011.
  39. “Robust speech recognition via large-scale weak supervision,” in ICML. PMLR, 2023.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.