Papers
Topics
Authors
Recent
Search
2000 character limit reached

Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

Published 29 Oct 2024 in cs.CL, cs.LG, cs.SD, and eess.AS | (2410.22179v2)

Abstract: Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backpropagation and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Location-relative attention mechanisms for robust long-form speech synthesis. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6194–6198.
  2. AudioLM: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533.
  3. A vector quantized approach for text to speech synthesis on real-world spontaneous speech. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12644–12652.
  4. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In ICASSP, pages 5884–5888. IEEE.
  5. Vall-t: Decoder-only generative transducer for robust and decoding-controllable text-to-speech. Preprint, arXiv:2401.14321.
  6. Image compression with product quantized masked image modeling. Transactions on Machine Learning Research.
  7. MADE: Masked autoencoder for distribution estimation. In International conference on machine learning, pages 881–889. PMLR.
  8. Alex Graves. 2012. Sequence transduction with recurrent neural networks. Preprint, arXiv:1211.3711.
  9. The impact of positional encoding on length generalization in transformers. Preprint, arXiv:2305.19466.
  10. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics, 11:1703–1718.
  11. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. In Advances in Neural Information Processing Systems, volume 33, pages 8067–8077. Curran Associates, Inc.
  12. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems, volume 33, pages 17022–17033. Curran Associates, Inc.
  13. Lessac Technologies, Inc. 2013. Release of voice factory audiobook recordings for Blizzard 2013.
  14. Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30.
  15. Online and Linear-time Attention by Enforcing Monotonic Alignments. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 2837–2846. JMLR.org.
  16. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  17. Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations.
  18. Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783.
  19. Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301.
  20. Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering. Preprint, arXiv:2401.07333.
  21. Neural discrete representation learning. Advances in neural information processing systems, 30.
  22. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  23. Neural codec language models are zero-shot text to speech synthesizers. Preprint, arXiv:2301.02111.
  24. Tacotron: Towards End-to-End Speech Synthesis. In Proc. Interspeech 2017, pages 4006–4010.
  25. Parallel wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6199–6203.
  26. Libritts: A corpus derived from librispeech for text-to-speech. Preprint, arXiv:1904.02882.
  27. Forward attention in sequence- to-sequence acoustic modeling for speech synthesis. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4789–4793.
  28. End-to-end dense video captioning with masked transformer. In CVPR, pages 8739–8748. Computer Vision Foundation / IEEE Computer Society.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 473 likes about this paper.