CoLLD: Contrastive Layer-to-layer Distillation for Compressing Multilingual Pre-trained Speech Encoders
Abstract: Large-scale self-supervised pre-trained speech encoders outperform conventional approaches in speech recognition and translation tasks. Due to the high cost of developing these large models, building new encoders for new tasks and deploying them to on-device applications are infeasible. Prior studies propose model compression methods to address this issue, but those works focus on smaller models and less realistic tasks. Thus, we propose Contrastive Layer-to-layer Distillation (CoLLD), a novel knowledge distillation method to compress pre-trained speech encoders by leveraging masked prediction and contrastive learning to train student models to copy the behavior of a large teacher model. CoLLD outperforms prior methods and closes the gap between small and large models on multilingual speech-to-text translation and recognition benchmarks.
- A. Mohamed et al., “Self-supervised speech representation learning: A review,” IEEE JSTSP, 2022.
- Seamless Communication et al., “Seamlessm4t—massively multilingual & multimodal machine translation,” arXiv, 2023.
- Y. Zhang et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv, 2023.
- H.-J. Chang, S.-w. Yang, and H.-y. Lee, “DistilHuBERT: Speech representation learning by layer-wise distillation of hidden-unit bert,” in ICASSP, 2022.
- Y. Lee, K. Jang, J. Goo, Y. Jung, and H. Kim, “Fithubert: Going thinner and deeper for knowledge distillation of speech self-supervised learning,” Interspeech, 2022.
- T. Ashihara, T. Moriya, K. Matsuura, and T. Tanaka, “Deep versus wide: An analysis of student architectures for task-agnostic knowledge distillation of self-supervised speech models,” Interspeech, 2022.
- R. Wang, Q. Bai, J. Ao, L. Zhou, Z. Xiong, Z. Wei, Y. Zhang, T. Ko, and H. Li, “Lighthubert: Lightweight and configurable speech representation learning with once-for-all hidden-unit bert,” Interspeech, 2022.
- K.-P. Huang, T.-h. Feng, Y.-K. Fu, T.-Y. Hsu, P.-C. Yen, W.-C. Tseng, K.-W. Chang, and H.-y. Lee, “Ensemble knowledge distillation of self-supervised speech models,” in ICASSP, 2023.
- K. Jang, S. Kim, S.-Y. Yun, and H. Kim, “Recycle-and-distill: Universal compression strategy for transformer-based speech ssl models with attention map reusing and masking distillation,” Interspeech, 2023.
- H. Wang, S. Wang, W.-Q. Zhang, and J. Bai, “Distilxlsr: A light weight cross-lingual speech representation model,” Interspeech, 2023.
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” TASLP, vol. 29, 2021.
- C.-I. J. Lai, Y. Zhang, A. H. Liu, S. Chang, Y.-L. Liao, Y.-S. Chuang, K. Qian, S. Khurana, D. Cox, and J. Glass, “PARP: Prune, adjust and re-prune for self-supervised speech recognition,” NeurIPS, 2021.
- Y. Peng, K. Kim, F. Wu, P. Sridhar, and S. Watanabe, “Structured pruning of self-supervised pre-trained models for speech recognition and understanding,” in ICASSP, 2023.
- H. Jiang, L. L. Zhang, Y. Li, Y. Wu, S. Cao, T. Cao, Y. Yang, J. Li, M. Yang, and L. Qiu, “Accurate and structured pruning for efficient automatic speech recognition,” Interspeech, 2023.
- H. Wang, S. Wang, W.-Q. Zhang, H. Suo, and Y. Wan, “Task-agnostic structured pruning of speech representation models,” Interspeech, 2023.
- Y. Peng, Y. Sudo, S. Muhammad, and S. Watanabe, “Dphubert: Joint distillation and pruning of self-supervised speech models,” Interspeech, 2023.
- Y. Peng, J. Lee, and S. Watanabe, “I3d: Transformer architectures with input-dependent dynamic depth for speech recognition,” in ICASSP, 2023.
- C.-F. Yeh, W.-N. Hsu, P. Tomasello, and A. Mohamed, “Efficient speech representation learning with low-bit quantization,” arXiv, 2022.
- S.-w. Yang et al., “SUPERB: Speech processing universal performance benchmark,” in Interspeech, 2021.
- H.-S. Tsai et al., “SUPERB-SG: Enhanced speech processing universal PERformance benchmark for semantic and generative capabilities,” in ACL, 2022.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS, 2020.
- A. Babu et al., “Xls-r: Self-supervised cross-lingual speech representation learning at scale,” Interspeech, 2022.
- V. Pratap et al., “Scaling speech technology to 1,000+ languages,” arXiv, 2023.
- A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” in ICML, 2022.
- A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” Interspeech, 2020.
- A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv, 2018.
- C. Wang, A. Wu, J. Gu, and J. Pino, “Covost 2 and massively multilingual speech translation,” in Interspeech, 2021.
- A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in SLT, 2023.
- J. Shi et al., “Ml-superb: Multilingual speech universal performance benchmark,” Interspeech, 2023.
- T.-h. Feng et al., “Superb@ slt 2022: Challenge on generalization and efficiency of self-supervised speech representation learning,” in SLT, 2022.
- A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” Interspeech, 2021.
- M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” in NAACL-HLT, 2019.
- H.-J. Chang, A. H. Liu, and J. Glass, “Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering,” in Interspeech, 2023.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, 2015.
- J. Zhao, H. Yang, E. Shareghi, and G. Haffari, “M-adapter: Modality adaptation for end-to-end speech-to-text translation,” Interspeech, 2022.
- M. R. Costa-jussà et al., “No language left behind: Scaling human-centered machine translation,” arXiv, 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
- S. Watanabe et al., “Espnet: End-to-end speech processing toolkit,” Interspeech, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.