SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training

Published 25 Jan 2022 in eess.AS, cs.CL, cs.LG, and cs.SD | (2201.10207v3)

Abstract: We introduce a new approach for speech pre-training named SPIRAL which works by learning denoising representation of perturbed data in a teacher-student framework. Specifically, given a speech utterance, we first feed the utterance to a teacher network to obtain corresponding representation. Then the same utterance is perturbed and fed to a student network. The student network is trained to output representation resembling that of the teacher. At the same time, the teacher network is updated as moving average of student's weights over training steps. In order to prevent representation collapse, we apply an in-utterance contrastive loss as pre-training objective and impose position randomization on the input to the teacher. SPIRAL achieves competitive or better results compared to state-of-the-art speech pre-training method wav2vec 2.0, with significant reduction of training cost (80% for BASE model, 65% for LARGE model). Furthermore, we address the problem of noise-robustness that is critical to real-world speech applications. We propose multi-condition pre-training by perturbing the student's input with various types of additive noise. We demonstrate that multi-condition pre-trained SPIRAL models are more robust to noisy speech (9.0% - 13.3% relative word error rate reduction on real noisy test data), compared to applying multi-condition training solely in the fine-tuning stage. Source code is available at https://github.com/huawei-noah/Speech-Backbones/tree/main/SPIRAL.

Abstract PDF Upgrade to Chat

Citations (22)

View on Semantic Scholar

Summary

The paper introduces SPIRAL, which leverages a teacher-student framework and in-utterance contrastive loss to learn perturbation-invariant speech representations.
It combines convolutional layers, Transformer blocks, and adaptive SpecAugment to enhance efficiency and noise robustness during training.
Experimental results on benchmarks show an 80% computational cost reduction for Base models and up to a 13.3% relative word error rate improvement.

An Overview of SPIRAL: Self-supervised Perturbation-Invariant Representation Learning for Speech Pre-Training

The paper presents SPIRAL, a novel approach to speech pre-training designed to learn a denoising representation of perturbed data, leveraging a teacher-student framework. Traditional speech recognition systems often rely on large amounts of labeled data, which can be costly and impractical for smaller languages or specialized domains. Self-supervised methods that do not require such labels have garnered research attention for their potential to mitigate these challenges.

Methodology

SPIRAL applies a teacher-student architecture, where both networks share the same structure, but their inputs differ. The teacher network receives clean speech input, while the student network is fed a perturbed version of the same speech. The student is trained to produce denoised representations that closely resemble the teacher's by minimizing an in-utterance contrastive loss. Within this framework, the teacher network's weights are updated as an exponential moving average of the student's weights, ensuring an evolving representation alignment.

To prevent potential representation collapse, the authors introduce position randomization, adding random padding to both sides of the teacher's input. This method disrupts positional correlations that the model might exploit, reducing the possibility of trivial solutions such as positional collapse.

Architecture and Implementation

The SPIRAL architecture is composed of convolutional layers followed by Transformer blocks, embodying both projection heads and predictors. The model's design facilitates aggressive down-sampling, which aids in efficient training. The authors implemented adaptive SpecAugment, a dynamic augmentation technique that becomes the primary mode of input perturbation, leading to noise-invariant representations crucial for downstream tasks.

A salient feature of the model is its ability to conduct multi-condition pre-training by introducing varied noise types in its training data, thereby improving noise robustness compared to models that only use multi-condition training during fine-tuning.

Experimental Results

The experimental evaluations demonstrate SPIRAL's effectiveness compared to existing methods like wav2vec 2.0 and HuBERT. When evaluated on standard benchmarks like LibriSpeech and Libri-Light, SPIRAL achieved competitive or superior performance with a substantial reduction in computational cost—80% for Base models and 65% for Large models. Particularly noteworthy is the model's robustness to noisy data, exhibiting up to 13.3% relative word error rate reduction compared to other methods deployed solely at the fine-tuning stage.

Implications and Future Directions

SPIRAL's methodology highlights the importance of learning noise-robust, perturbation-invariant representations. Its efficient training and high performance make it a formidable framework for real-world applications where noise is prevalent, such as in industrial-scale ASR systems. The reduced reliance on labeled data aligns with the broader movement towards self-supervised learning, which is vital for resource-constrained settings.

Future research could explore the expansion of SPIRAL's principles to modalities beyond speech, such as image or text data, potentially leading to more generalized self-supervised learning frameworks. Additionally, integrating SPIRAL with other unsupervised or semi-supervised learning strategies could further enhance its robustness and adaptability.

Conclusion

SPIRAL offers a promising advancement in the speech recognition field, addressing both efficiency in computation and resilience to noise, without succumbing to representational collapse. Its contribution to minimizing the annotation bottleneck while maintaining high performance makes it a significant stride in the ongoing evolution of ASR systems.