Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

Published 7 May 2025 in cs.SD, cs.CL, and eess.AS | (2505.04457v2)

Abstract: Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like LLMs. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaveFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2's superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.

Abstract PDF Upgrade to Chat

Summary

Miipher-2: A Universal Speech Restoration Framework for Large-Scale Data

Miipher-2 presents a universal speech restoration model that aims to address the unique complexities associated with large-scale speech data restoration, particularly for training data cleaning in generative models, including large language models (LLMs). The motivation for this work lies in the necessity of reliable, high-quality audio data for training sophisticated models, as traditional methods of data collection such as web-scraping often introduce noisy samples.

Innovation Through Self-Supervised Learning

The Miipher-2 model leverages the Universal Speech Model (USM) as a robust, conditioning-free feature extractor. This approach sidesteps the need for explicit textual or speaker ID conditioning, enabling the model to generalize across over 300 languages, including those with limited available high-quality training data. Such a feature allows the model to infer clean USM features even from noisy input, successfully addressing the challenge of handling unseen languages.

Architectural Efficiency and Optimization

A key aspect of Miipher-2 lies in its computational efficiency, which is crucial given the scale of data involved. The use of parallel adapters (PAs) in place of conventional, more cumbersome feature cleaner architectures minimizes memory footprint and accelerates processing speed. Additionally, the integration of the WaneFit neural vocoder, with optimized memory usage adjustments, further drives the efficiency required for processing up to a million hours of speech data in a frame of roughly three days using consumer-grade accelerators.

Competitive Performance Metrics

The experimental evaluation illustrates Miipher-2’s efficacy in speech restoration tasks, delivering performance that is superior or comparable to existing models in word-error-rate (WER), speaker similarity, and MOS scores. Importantly, these results are achieved across varying languages, including those not originally represented within the training datasets, demonstrating the model’s broad applicability.

Practical and Theoretical Implications

The implications of Miipher-2 are significant for both practical applications and theoretical understanding in AI. Practically, the framework facilitates the cleaning and enhancement of massive speech datasets, pivotal for developing models that support text-to-speech synthesis and other audio-sensitive applications. Theoretically, it presents a substantial contribution to the research on self-supervised learning models, emphasizing their ability to generalize without explicit conditioning information.

Looking Forward: Future Developments in AI

Miipher-2 sets a foundation for further exploration into speech restoration practices, particularly those requiring efficient processing of large datasets. Future research could investigate extending the methodology for broader audio applications beyond speech, enriching multi-modal generative model training datasets. Moreover, the methodology could inspire innovations in low-resource language support, pushing the boundaries of AI inclusivity.

In summary, Miipher-2 proves to be an effective, efficient model capable of tackling the inherent challenges of large-scale data cleaning, offering promising directions for evolving speech processing technologies and AI applications on a global scale.