A Comparative Analysis of Task-Agnostic Distillation Methods for Compressing Transformer Language Models
Abstract: LLMs have become a vital component in modern NLP, achieving state of the art performance in a variety of tasks. However, they are often inefficient for real-world deployment due to their expensive inference costs. Knowledge distillation is a promising technique to improve their efficiency while retaining most of their effectiveness. In this paper, we reproduce, compare and analyze several representative methods for task-agnostic (general-purpose) distillation of Transformer LLMs. Our target of study includes Output Distribution (OD) transfer, Hidden State (HS) transfer with various layer mapping strategies, and Multi-Head Attention (MHA) transfer based on MiniLMv2. Through our extensive experiments, we study the effectiveness of each method for various student architectures in both monolingual (English) and multilingual settings. Overall, we show that MHA transfer based on MiniLMv2 is generally the best option for distillation and explain the potential reasons behind its success. Moreover, we show that HS transfer remains as a competitive baseline, especially under a sophisticated layer mapping strategy, while OD transfer consistently lags behind other approaches. Findings from this study helped us deploy efficient yet effective student models for latency-critical applications.
- Layer normalization. arXiv preprint arXiv:1607.06450.
- BinaryBERT: Pushing the limit of BERT quantization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4334–4348, Online. Association for Computational Linguistics.
- Matan Ben Noach and Yoav Goldberg. 2020. Compressing pre-trained language models by matrix decomposition. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 884–889, Suzhou, China. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Drone: Data-aware low-rank compression for large nlp models. In Advances in Neural Information Processing Systems, volume 34, pages 29321–29334. Curran Associates, Inc.
- Unsupervised cross-lingual representation learning at scale.
- Xnli: Evaluating cross-lingual sentence representations.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations.
- RAIL-KD: RAndom intermediate layer mapping for knowledge distillation. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1389–1400, Seattle, United States. Association for Computational Linguistics.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Annealing knowledge distillation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2493–2504, Online. Association for Computational Linguistics.
- What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
- Improving task-agnostic bert distillation with layer mapping search. Neurocomputing, 461:194–203.
- TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online. Association for Computational Linguistics.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- I-bert: Integer-only bert quantization. In International conference on machine learning, pages 5506–5518. PMLR.
- Revisiting intermediate layer distillation for compressing language models: An overfitting perspective. In Findings of the Association for Computational Linguistics: EACL 2023, pages 158–175, Dubrovnik, Croatia. Association for Computational Linguistics.
- Block pruning for faster transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10619–10629, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- BERT-EMD: Many-to-many layer mapping for BERT compression with earth mover’s distance. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3009–3018, Online. Association for Computational Linguistics.
- Mix{kd}: Towards efficient distillation of large-scale language models. In International Conference on Learning Representations.
- A global past-future early exit method for accelerating inference of pre-trained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2013–2023, Online. Association for Computational Linguistics.
- Multi-granularity structural knowledge distillation for language model compression. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1001–1011, Dublin, Ireland. Association for Computational Linguistics.
- Rethinking task-specific knowledge distillation: Contextualized corpus as better textbook. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10652–10658, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- FastBERT: a self-distilling BERT with adaptive inference time. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6035–6044, Online. Association for Computational Linguistics.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Knowledge distillation with reptile meta-learning for pretrained language model compression. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4907–4917, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- LadaBERT: Lightweight adaptation of BERT through hybrid model compression. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3225–3234, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198.
- Xtremedistiltransformers: Task transfer for task-agnostic distillation. arXiv preprint arXiv:2106.04563.
- Distilling linguistic context for language model compression. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 364–378, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Alp-kd: Attention-based layer projection for knowledge distillation. In Proceedings of the AAAI Conference on artificial intelligence, volume 35, pages 13657–13665.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821.
- Densely guided knowledge distillation using multiple teacher assistants. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9395–9404.
- Patient knowledge distillation for BERT model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4323–4332, Hong Kong, China. Association for Computational Linguistics.
- MobileBERT: a compact task-agnostic BERT for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158–2170, Online. Association for Computational Linguistics.
- KroneckerBERT: Significant compression of pre-trained language models through kronecker decomposition and knowledge distillation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2116–2127, Seattle, United States. Association for Computational Linguistics.
- Multilingual representation distillation with contrastive learning. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1477–1490, Dubrovnik, Croatia. Association for Computational Linguistics.
- BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Neural architecture search for effective teacher-student knowledge transfer in language models. arXiv preprint arXiv:2303.09639.
- Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
- SkipBERT: Efficient inference with shallow layer skipping. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7287–7301, Dublin, Ireland. Association for Computational Linguistics.
- MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, Online. Association for Computational Linguistics.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Advances in Neural Information Processing Systems, volume 33, pages 5776–5788. Curran Associates, Inc.
- How to distill your BERT: An empirical study on the impact of weight initialisation and distillation objectives. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1843–1852, Toronto, Canada. Association for Computational Linguistics.
- Why skip if you can combine: A simple knowledge distillation technique for intermediate layers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1016–1021, Online. Association for Computational Linguistics.
- Universal-KD: Attention-based output-grounded intermediate layer knowledge distillation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7649–7661, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Structured pruning learns compact and accurate models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528, Dublin, Ireland. Association for Computational Linguistics.
- BERxiT: Early exiting for BERT with better fine-tuning and extension to regression. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 91–104, Online. Association for Computational Linguistics.
- Canwen Xu and Julian McAuley. 2023. A survey on model compression and acceleration for pretrained language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10566–10575.
- BERT-of-theseus: Compressing BERT by progressive module replacing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7859–7869, Online. Association for Computational Linguistics.
- Few-shot task-agnostic neural architecture search for distilling large language models. In Advances in Neural Information Processing Systems, volume 35, pages 28644–28656. Curran Associates, Inc.
- Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 36–39. IEEE.
- Adversarial data augmentation for task-specific knowledge distillation of pre-trained transformers. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11685–11693.
- BERT learns to teach: Knowledge distillation with meta learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7037–7049, Dublin, Ireland. Association for Computational Linguistics.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 19–27.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.