Scaling Speech Technology to 1,000+ Languages

Published 22 May 2023 in cs.CL, cs.SD, and eess.AS | (2305.13516v1)

Abstract: Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.

Abstract PDF Upgrade to Chat

Citations (270)

View on Semantic Scholar

Summary

The paper presents the MMS project, a novel approach using a massive multilingual dataset to expand speech technology to over 1,000 languages.
It leverages self-supervised learning with frameworks like wav2vec 2.0 and XLS-R to enhance automatic speech recognition, language identification, and text-to-speech synthesis.
The study reports reduced error rates and robust performance across diverse languages while carefully addressing bias and ethical considerations in dataset use.

Scaling Speech Technology to 1,000+ Languages: An Overview

The paper "Scaling Speech Technology to 1,000+ Languages" introduces the Massively Multilingual Speech (MMS) project, a comprehensive initiative aimed at significantly expanding the language coverage of speech technology. This expansion addresses the limitations of current speech systems, which predominantly focus on a small subset of the world’s 7,000 languages. The MMS project delivers on this objective by introducing a novel dataset and leveraging advancements in self-supervised learning to build robust models for automatic speech recognition (ASR), language identification (LID), and speech synthesis (TTS).

Dataset Creation and Alignment

The MMS dataset includes labeled paired speech and text data for 1,107 languages and unlabeled audio for 3,809 languages. The dataset is primarily derived from readings of public religious texts, specifically the New Testament, which offers a consistent linguistic and phonetic structure across numerous languages. A central challenge addressed in this portion of the work is the efficient forced alignment of audio and text, which the authors achieve through a novel algorithm that operates effectively at scale. The introduction of a star token ($$) in text alignment is noteworthy for its utility in handling mismatches between spoken audio and provided transcripts.

Self-Supervised Model Pre-training

The authors utilize the wav2vec 2.0 framework, a state-of-the-art approach for self-supervised speech representation learning, extending it through the XLS-R architecture to cover 1,406 languages. Training these models involves a strategy that judiciously balances data sampling across both language and dataset dimensions, permitting the inclusion of a wide-ranging multilingual corpus. This pre-training environment results in models superior to previous benchmarks, particularly on under-represented languages.

Automatic Speech Recognition

In the field of ASR, the project makes several significant strides. The MMS models trained on 1,107 languages demonstrate a marked reduction in word error rates compared to existing models like Whisper, particularly when supplemented by n-gram LLMs. The integration of language-specific adapters allows these models to handle multilingual input without significant performance degradation, maintaining accuracy across a vastly expanded language set.

Language Identification

The expansion of LID capabilities to encompass over 4,017 languages underscores the MMS project's contributions. The methodology showcases the robustness of combining MMS-lab-Unlabeled and MMS-Unlabeled data to achieve competitive results in in-domain as well as out-of-domain settings when compared against existing datasets like FLEURS and VoxLingua-107.

Text-to-Speech Synthesis

MMS also advances TTS technology using VITS models for 1,107 languages. Despite resource-intensive baseline TTS setups, the MMS models achieve reasonable efficiency and quality by employing optimized training routines and pre-processing steps, including denoising and pitch variance filtering for recordings with background music. Evaluation on a diverse set of test scenarios, including in-domain MMS-lab and out-of-domain FLEURS data, highlights the ability of the proposed models to produce intelligible and natural sounding speech across a wide spectrum of languages.

Addressing Bias and Ethical Considerations

The development of the MMS project included assessing and mitigating potential biases inherent in the dataset, particularly gender bias and domain bias from religious texts. The authors’ analysis shows that while models trained on the MMS dataset exhibit some bias, the level is comparable to models trained on datasets from other domains, like FLEURS. Additionally, the ethical implications of using religious data in machine learning are weighed carefully, asserting that such use is generally acceptable in the field and citing similar prior studies.

Conclusion and Implications

The research presented in this paper is a substantial contribution towards democratizing speech technology access across a broader spectrum of languages and cultures. By leveraging both traditional and innovative machine learning techniques, the MMS project not only extends existing technological boundaries but also sets a precedent for future developments in multilingual speech technology. The potential for further scaling, integration of additional speech-related tasks, and the realization of multi-task models presents exciting avenues for continued exploration and development in artificial intelligence.

Markdown Report Issue