SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

Published 14 Oct 2021 in eess.AS, cs.CL, cs.LG, and cs.SD | (2110.07205v3)

Abstract: Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.

Abstract PDF Upgrade to Chat

Citations (178)

View on Semantic Scholar

Summary

The paper introduces SpeechT5, a unified encoder-decoder framework that integrates speech and text processing tasks.
It utilizes cross-modal vector quantization to effectively align representations, resulting in improved ASR, TTS, and ST performance.
Comprehensive evaluations show that SpeechT5 outperforms state-of-the-art baselines, indicating strong potential for future multimodal advancements.

The paper proposes a novel framework named SpeechT5 that unifies modalities in spoken language processing tasks through an encoder-decoder pre-training approach. Inspired by the success of the T5 model in natural language processing, SpeechT5 leverages large-scale unlabeled speech and text data to achieve joint representation learning. The model design targets tasks such as automatic speech recognition (ASR), text-to-speech (TTS), speech translation (ST), and more.

Key Contributions

Unified-Modal Framework: SpeechT5 is centered around a shared encoder-decoder network while employing six modal-specific pre/post-nets for speech and text. This structure allows the model to handle diverse tasks in a speech/text to speech/text format cohesively.
Cross-Modal Vector Quantization: The proposed method aligns speech and text representations into a unified semantic space. It incorporates vector quantization as an interface between the encoder and decoder, thereby effectively mixing speech and text information for better cross-modal learning.
Comprehensive Evaluation: Extensive experiments demonstrate SpeechT5's superiority across a spectrum of tasks. Notably, it outperforms existing models in ASR with both clean and noisy data, and achieves competitive results in TTS, ST, and other spoken language processing challenges.

Strong Numerical Results and Bold Claims

In ASR tasks, SpeechT5 surpasses wav2vec 2.0 and HuBERT Baselines, achieving lower word error rates and enhanced performance even without LLM integration.
The model shows a significant edge over state-of-the-art baselines in TTS quality, voice conversion, and SID accuracy, demonstrating the effectiveness of the pre-training approach.

Implications and Future Directions

SpeechT5 bridges the gap between speech and text modalities, showcasing promising capabilities for tasks requiring modality transformation. This alignment can lead to enhancements in cross-modal understanding and generation, and suggests potential improvements in areas such as multilingual processing and speech-to-speech translation.

Future developments could involve scaling the model with more data or extending it to handle additional languages, thus broadening its applicability. As the field continues to explore multimodal learning, innovations such as SpeechT5 could redefine methodologies in spoken language processing tasks.

Conclusion

The SpeechT5 framework presents a significant step forward in the integration of speech and text processing tasks under a unified model. Through novel pre-training strategies and comprehensive evaluations, the work lays a foundation for future exploration and application of cross-modal learning techniques in AI.