Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates

Published 28 May 2025 in cs.SD, cs.AI, and eess.AS | (2505.22608v1)

Abstract: This paper presents a novel approach for speech foundation models compression that tightly integrates model pruning and parameter update into a single stage. Highly compact layer-level tied self-pinching gates each containing only a single learnable threshold are jointly trained with uncompressed models and used in fine-grained neuron level pruning. Experiments conducted on the LibriSpeech-100hr corpus suggest that our approach reduces the number of parameters of wav2vec2.0-base and HuBERT-large models by 65% and 60% respectively, while incurring no statistically significant word error rate (WER) increase on the test-clean dataset. Compared to previously published methods on the same task, our approach not only achieves the lowest WER of 7.05% on the test-clean dataset under a comparable model compression ratio of 4.26x, but also operates with at least 25% less model compression time.

Abstract PDF Upgrade to Chat

Summary

Overview of One-pass Compression of Speech Foundation Models

The paper presents an innovative approach for compressing speech foundation models such as wav2vec2.0 and HuBERT. The primary aim of the research is to address the significant memory and computational demands posed by these models, which limit their deployment in resource-constrained environments. The proposed method integrates model pruning and parameter update into a single, efficient pass, offering substantial reductions in the model size without sacrificing accuracy.

Compression Approach and Results

The method employs sparsity-aware self-pinching gates, introducing compact layer-level tied self-pinching gates with a single learnable threshold in each. These gates facilitate fine-grained neuron-level pruning during training, optimizing both weight magnitude and sparse representation simultaneously. Experiments conducted on the LibriSpeech-100hr corpus demonstrated that the approach could reduce parameters of wav2vec2.0-base and HuBERT-large by 65% and 60%, respectively, without causing a statistically significant increase in the word error rate (WER).

The results indicate a superior performance compared to previous compression techniques, achieving a WER of 7.05% on the test-clean dataset with a model compression ratio of 4.26x, using at least 25% less model compression time. This outcome highlights the method's effectiveness and efficiency, particularly in maintaining accuracy while significantly reducing model size.

Implications and Future Directions

The implications of this research are profound for practical deployment of speech recognition systems in constrained settings. By significantly reducing the computational overhead and memory footprint, this method enables more efficient real-time processing on devices with limited resources. The approach sets a precedent for integrating pruning and optimization stages, potentially guiding future model design and refinement strategies in AI.

Theoretically, the integration of pruning and parameter update into a single coherent stage offers a new perspective on model optimization strategies. It balances the trade-off between compression and accuracy without extensive post-training adjustments, providing a framework that can be adapted to various architectures and applications beyond speech recognition.

Future research could explore the application of self-pinching gates to other types of foundation models and investigate the interplay between model architecture and sensitivity to further optimize pruning strategies. Additionally, exploring adaptive thresholds and dynamic sparsity levels may enhance the ability to tailor the compression process to specific model requirements or data characteristics.

Conclusion

This paper contributes a valuable method for advancing the compression of speech foundation models. By focusing on a unified pruning and parameter update approach using sparsity-aware mechanisms, it addresses critical constraints in deploying sophisticated speech models in practical applications. The promising results pave the way for further exploration into efficient model compression, with the potential to influence broader AI model design and optimization strategies.

Markdown Report Issue