Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

Published 5 Mar 2023 in eess.AS, cs.HC, cs.LG, and cs.SD | (2303.02719v2)

Abstract: Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “WavThruVec: Latent speech representation as intermediate features for neural speech synthesis,” in Proc. Interspeech, 2022, pp. 833–837.
  2. “VQTTS: High-fidelity text-to-speech synthesis with self-supervised VQ acoustic feature,” in Proc. Interspeech, 2022, pp. 1596–1600.
  3. “HierSpeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis,” in Proc. NeurIPS, 2023.
  4. Shu-wen et al. Yang, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech, 2021, pp. 1194–1198.
  5. “wav2vec: Unsupervised pre-training for speech recognition,” in Proc. Interspeech, 2019, pp. 3465–3469.
  6. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, 2020, vol. 33, pp. 12449–12460.
  7. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  8. “Speech resynthesis from discrete disentangled self-supervised representations,” in Proc. Interspeech, 2021, pp. 3615–3619.
  9. “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021.
  10. “Probing acoustic representations for phonetic properties,” in Proc. ICASSP, 2021, pp. 311–315.
  11. “Layer-wise analysis of a self-supervised speech representation model,” in Proc. ASRU, 2021, pp. 914–921.
  12. “Exploration of a self-supervised speech model: A study on emotional corpora,” in Proc. SLT, 2022.
  13. “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.
  14. A. Lańcucki, “FastPitch: Parallel text-to-speech with pitch prediction,” in Proc. ICASSP, 2021, pp. 6588–6592.
  15. “FastSpeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
  16. “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” Proc. NeurIPS, vol. 33, pp. 8067–8077, 2020.
  17. “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Proc. NeurIPS, pp. 17022–17033, 2020.
  18. “A survey on neural speech synthesis,” arXiv preprint arXiv:2106.15561, 2021.
  19. “How to train your fillers: uh and um in spontaneous speech synthesis,” in Proc. SSW, 2019, vol. 10, pp. 245–250.
  20. “Spontaneous conversational speech synthesis from found data,” in Proc. Interspeech, 2019, pp. 4435–4439.
  21. “A vector quantized approach for text to speech synthesis on real-world spontaneous speech,” arXiv preprint arXiv:2302.04215, 2023.
  22. K. Ito and L. Johnson, “The LJ Speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  23. Y. Ferstl and R. McDonnell, “Investigating the use of recurrent motion modelling for speech gesture generation,” in Proc. IVA, 2018, pp. 93–98.
  24. “Breathing and speech planning in spontaneous speech synthesis,” in Proc. ICASSP, 2020, pp. 7649–7653.
  25. “One tts alignment to rule them all,” in Proc. ICASSP, 2022, pp. 6092–6096.
  26. “Speech synthesis evaluation—state-of-the-art assessment and suggestion for a novel research program,” in Proc. SSW, 2019.
  27. “Generalization ability of mos prediction networks,” in Proc. ICASSP, 2022, pp. 8442–8446.
Citations (4)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.