MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

Published 29 Nov 2022 in cs.MM, cs.CL, cs.LG, cs.SD, and eess.AS | (2212.00500v1)

Abstract: In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.

Abstract PDF Upgrade to Chat

Citations (12)

View on Semantic Scholar

Summary

The paper presents a multi-modal, multi-task pre-training framework integrating five tasks to enhance Mandarin ASR.
It leverages phoneme data to bridge speech and text, yielding over 40% relative improvement on the AISHELL-1 dataset.
The ablation study validates the crucial role of the phoneme-to-text task, affirming its impact on overall system performance.

The paper "MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition" presents a sophisticated approach to enhancing Mandarin Automatic Speech Recognition (ASR) using a multi-modal and multi-task learning framework. This research addresses the challenges inherent in Mandarin's ideographic writing system by integrating phoneme modalities to bridge the gap between speech and text, a novel strategy not typically required for alphabetic languages such as English.

Methodology and Framework

MMSpeech employs a comprehensive encoder-decoder architecture that leverages five distinct tasks: self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T), alongside masked speech prediction (MSP), phoneme prediction (PP), and supervised speech-to-text (S2T) tasks. This framework utilizes both unlabeled speech and text, with a notable integration of phoneme data to capture modality-invariant information, essential for Mandarin because of its high homophone density.

The encoder-decoder pre-training integrates P2T and S2C tasks to better utilize large-scale text and speech data, respectively. The P2T task modifies conventional text-infilling by using phonemes rather than Chinese characters, effectively reducing the discrepancy between modalities. The S2C task strengthens the decoder's ability to encapsulate speech information, enhancing performance in sequence-to-sequence scenarios.

In terms of encoder pre-training, the MSP task refines speech representation using phoneme distributions as targets, while the PP task aids in aligning speech with text by predicting phonemes through a CTC loss on paired speech-text data. The inclusion of S2T within pre-training allows for performance evaluation without further fine-tuning, simplifying the validation of pre-training efficacy.

Experimental Results and Analysis

The results obtained from experiments on the AISHELL-1 dataset demonstrate that MMSpeech achieves state-of-the-art performance, offering a more than 40% relative improvement over previous methods. Such outcomes reflect the efficacy of the multi-task framework and the pivotal role of phoneme integration.

Key findings from the ablation study highlight the significance of each task's contribution to the framework. Notably, the P2T task plays a particularly crucial role, with results indicating it cannot be supplanted by external LLMs. This underscores the effectiveness of leveraging unlabeled text data, particularly in languages with high homophone incidence like Mandarin.

Implications and Future Outlook

The implications of MMSpeech are profound, both practically and theoretically. Practically, it showcases a robust framework capable of significantly advancing ASR, particularly for ideographic languages such as Mandarin. Theoretically, it establishes the importance of modality-bridging elements like phonemes, offering a generalizable approach that might be adapted for other complex language systems.

Looking to the future, this research encourages exploration into other languages with similar challenges and extends the application of multi-modal strategies to areas beyond ASR. Further developments might include testing with diverse datasets to improve robustness and integrating additional modalities to enhance performance in more varied linguistic environments.

In conclusion, MMSpeech represents a substantial contribution to the field of speech recognition, providing a well-rounded methodological framework that leverages multi-modal and multi-task learning to overcome significant linguistic challenges.