Papers
Topics
Authors
Recent
Search
2000 character limit reached

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Published 22 Dec 2023 in cs.SD and eess.AS | (2312.14398v2)

Abstract: Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech.   ISCA, 2017, pp. 4006–4010.
  2. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
  3. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions,” in Proc. ICASSP.   IEEE, 2018, pp. 4779–4783.
  4. X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He et al., “NaturalSpeech: End-to-end text to speech synthesis with human-level quality,” arXiv preprint arXiv:2205.04421, 2022.
  5. T. Saeki, S. Maiti, X. Li, S. Watanabe, S. Takamichi, and H. Saruwatari, “Learning to speak from text: Zero-shot multilingual text-to-speech with unsupervised text pretraining,” in Proc. IJCAI, 2023.
  6. Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. J. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” in Proc. Interspeech.   ISCA, 2019, pp. 2080–2084.
  7. T. Nekvinda and O. Dusek, “One model, many languages: Meta-learning for multilingual text-to-speech,” in Proc. Interspeech.   ISCA, 2020, pp. 2972–2976.
  8. E. Casanova, J. Weber, C. D. Shulby, A. C. Júnior, E. Gölge, and M. A. Ponti, “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in Proc. ICML, vol. 162.   PMLR, 2022, pp. 2709–2720.
  9. Y. Jia, H. Zen, J. Shen, Y. Zhang, and Y. Wu, “PnG BERT: Augmented BERT on phonemes and graphemes for neural TTS,” Proc. Interspeech, pp. 151–155, 2021.
  10. G. Zhang, K. Song, X. Tan, D. Tan, Y. Yan, Y. Liu, G. Wang, W. Zhou, T. Qin, T. Lee et al., “Mixed-phoneme BERT: Improving bert with mixed phoneme and sup-phoneme representations for text to speech,” in Proc. Interspeech, 2022, pp. 456–460.
  11. Y. A. Li, C. Han, X. Jiang, and N. Mesgarani, “Phoneme-level BERT for enhanced prosody of text-to-speech with grapheme predictions,” in Proc. ICASSP.   IEEE, 2023, pp. 1–5.
  12. L. T. Nguyen, T. Pham, and D. Q. Nguyen, “XPhoneBERT: A pre-trained multilingual model for phoneme representations for text-to-speech,” in Proc. Interspeech, 2023, pp. 5506–5510.
  13. Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, z. Chen, P. Nguyen, R. Pang, I. Lopez Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Proc. NIPS, vol. 31, 2018, pp. 4480–4490.
  14. E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in Proc. ICASSP, 2020, pp. 6184–6188.
  15. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NIPS, 2020, pp. 12 449–12 460.
  16. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  17. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  18. B. Thomas, S. Kessler, and S. Karout, “Efficient adapter transfer of self-supervised speech models for automatic speech recognition,” in Proc. ICASSP, 2022, pp. 7102–7106.
  19. Z. Chen, S. Chen, Y. Wu, Y. Qian, C. Wang, S. Liu, Y. Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” in Proc. ICASSP.   IEEE, 2022, pp. 6147–6151.
  20. H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” in Proc. NIPS, vol. 34, 2021, pp. 16 251–16 265.
  21. W.-C. Huang, S.-W. Yang, T. Hayashi, H.-Y. Lee, S. Watanabe, and T. Toda, “S3PRL-VC: Open-source voice conversion framework with self-supervised speech representations,” in Proc. ICASSP, 2022, pp. 6552–6556.
  22. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  23. S. Liu, Y. Guo, C. Du, X. Chen, and K. Yu, “DSE-TTS: Dual speaker embedding for cross-lingual text-to-speech,” in Proc. Interspeech, 2023, pp. 616–620.
  24. A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in Proc. ASRU, 2021, pp. 914–921.
  25. S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Proc. NIPS, vol. 31, 2018.
  26. Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proc. ICML.   PMLR, 2018, pp. 5180–5189.
  27. E. Casanova, C. Shulby, E. Gölge, N. M. Müller, F. S. de Oliveira, A. Candido Jr., A. da Silva Soares, S. M. Aluisio, and M. A. Ponti, “SC-GlowTTS: An efficient zero-shot multi-speaker text-to-speech model,” in Proc. Interspeech, 2021, pp. 3645–3649.
  28. D. Xin, T. Komatsu, S. Takamichi, and H. Saruwatari, “Disentangled speaker and language representations using mutual information minimization and domain adaptation for cross-lingual TTS,” in Proc. ICASSP.   IEEE, 2021, pp. 6608–6612.
  29. J. Yang and L. He, “Cross-lingual text-to-speech using multi-task learning and speaker classifier joint training,” arXiv preprint arXiv:2201.08124, 2022.
  30. N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis with Transformer network,” in Proc. AAAI, 2019, pp. 6706–6713.
  31. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
  32. A. Lancucki, “FastPitch: Parallel text-to-speech with pitch prediction,” in Proc. ICASSP.   IEEE, 2021, pp. 6588–6592.
  33. J. Yang and L. He, “Towards universal text-to-speech,” in Proc. Interspeech, 2020, pp. 3171–3175.
  34. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in ISCA, September 2016.   ISCA, 2016, p. 125.
  35. R. Badlani, R. Valle, K. J. Shih, J. F. Santos, S. Gururani, and B. Catanzaro, “RAD-MMM: Multilingual multiaccented multispeaker text to speech,” in Proc. Interspeech, 2023, pp. 626–630.
  36. K. J. Shih, R. Valle, R. Badlani, A. Lancucki, W. Ping, and B. Catanzaro, “RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis,” in Proc. ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
  37. H. Cho, W. Jung, J. Lee, and S. H. Woo, “SANE-TTS: Stable and natural end-to-end multilingual text-to-speech,” in Proc. Interspeech.   ISCA, 2022.
  38. J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML.   PMLR, 2021, pp. 5530–5540.
  39. M. Chen, M. Chen, S. Liang, J. Ma, L. Chen, S. Wang, and J. Xiao, “Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding,” in Proc. Interspeech.   ISCA, 2019, pp. 2105–2109.
  40. B. Li, Y. Zhang, T. N. Sainath, Y. Wu, and W. Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in Proc. ICASSP.   IEEE, 2019, pp. 5621–5625.
  41. D. Wells and K. Richmond, “Cross-lingual transfer of phonological features for low-resource speech synthesis,” in Proc. SSW, 2021, pp. 160–165.
  42. D. R. Mortensen, S. Dalmia, and P. Littell, “Epitran: Precision G2P for many languages,” in Proc. LREC, Paris, France, May 2018.
  43. Y. Wang, J. Li, H. Wang, Y. Qian, C. Wang, and Y. Wu, “Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition,” in Proc. ICASSP.   IEEE, 2022, pp. 7097–7101.
  44. H. Siuzdak, P. Dura, P. van Rijn, and N. Jacoby, “WavThruVec: Latent speech representation as intermediate features for neural speech synthesis,” in Proc. Interspeech.   ISCA, 2022, pp. 833–837.
  45. J.-h. Lin, Y. Y. Lin, C.-M. Chien, and H.-y. Lee, “S2VC: A framework for any-to-any voice conversion with self-supervised pretrained representations,” in Proc. Interspeech, 2021, pp. 836–840.
  46. S. Chen, Y. Wu, C. Wang, S. Liu, Z. Chen, P. Wang, G. Liu, J. Li, J. Wu, X. Yu, and F. Wei, “Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?” in Proc. Interspeech.   ISCA, Sep. 2022, pp. 3699–3703.
  47. Y.-J. Zhang, C. Zhang, W. Song, Z. Zhang, Y. Wu, and X. He, “Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2812–2823, 2023.
  48. A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Proc. Interspeech.   ISCA, Aug. 2021, pp. 2426–2430.
  49. C. Liu, Z.-H. Ling, and L.-H. Chen, “Pronunciation dictionary-free multilingual speech synthesis by combining unsupervised and supervised phonetic representations,” in Proc. Interspeech, 2022, pp. 4282–4286.
  50. D. Wells, K. Richmond, and W. Lamb, “A Low-Resource Pipeline for Text-to-Speech from Found Data With Application to Scottish Gaelic,” in Proc. INTERSPEECH 2023, 2023, pp. 4324–4328.
  51. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NIPS, vol. 33.   Curran Associates, Inc., 2020, pp. 17 022–17 033.
  52. R. Badlani, A. Łańcucki, K. J. Shih, R. Valle, W. Ping, and B. Catanzaro, “One TTS alignment to rule them all,” in Proc. ICASSP.   IEEE, 2022, pp. 6092–6096.
  53. D. Berrebbi, J. Shi, B. Yan, O. López-Francisco, J. Amith, and S. Watanabe, “Combining spectral and self-supervised features for low resource speech recognition and translation,” in Proc. Interspeech, 2022, pp. 3533–3537.
  54. H. Guo, F. Xie, F. K. Soong, X. Wu, and H. Meng, “A multi-stage multi-codebook VQ-VAE approach to high-performance neural TTS,” in Proc. Interspeech.   ISCA, 2022, pp. 1611–1615.
  55. H. Guo, F. Xie, X. Wu, F. K. Soong, and H. Meng, “MSMC-TTS: Multi-stage multi-codebook VQ-VAE based neural TTS,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1811–1824, 2023.
  56. W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” in Proc. INTERSPEECH, 2021, pp. 2207–2211.
  57. V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A large-scale multilingual dataset for speech research,” in Proc. Interspeech, 2020, p. 2757–2761.
  58. T. Schultz, N. T. Vu, and T. Schlippe, “GlobalPhone: A multilingual text & speech database in 20 languages,” in Proc. ICASSP, 2013, pp. 8126–8130.
  59. K. Park and T. Mulc, “CSS10: A collection of single speaker speech datasets for 10 languages,” in Proc. Interspeech, 2019, pp. 1566–1570.
  60. K. Ito and L. Johnson, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  61. N. L. Technology, “NST Swedish speech synthesis,” https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/, 2003.
  62. I. T. Union. Recommendation G.191: Software Tools and Audio Coding Standardization. (2005, Nov 11). [Online]. Available: https://www.itu.int/rec/T-REC-P.56/en
  63. H. S. Heo, B.-J. Lee, J. Huh, and J. S. Chung, “CLOVA baseline system for the VoxCeleb Speaker Recognition Challenge 2020,” arXiv preprint arXiv:2009.14153, 2020.
  64. Data-baker. Chinese Standard Mandarin Speech Copus. (2022, Nov). [Online]. Available: https://www.data-baker.com/open_source.html
  65. L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE.” Journal of machine learning research, vol. 9, no. 11, 2008.
  66. P. Do, M. Coler, J. Dijkstra, and E. Klabbers, “Text-to-speech for under-resourced languages: Phoneme mapping and source language selection in transfer learning,” in Proc. ELRA/ISCA SIG on Under-Resourced Languages, 2022, pp. 16–22.
  67. T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525.
Citations (17)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 5 likes about this paper.