MiniPLM: Knowledge Distillation for Pre-Training Language Models

Published 22 Oct 2024 in cs.CL | (2410.17215v3)

Abstract: Knowledge distillation (KD) is widely used to train small, high-performing student LMs using large teacher LMs. While effective in fine-tuning, KD during pre-training faces efficiency, flexibility, and effectiveness issues. Existing methods either incur high computational costs due to online teacher inference, require tokenization matching between teacher and student LMs, or risk losing the difficulty and diversity of the teacher-generated training data. In this work, we propose MiniPLM, a KD framework for pre-training LMs by refining the training data distribution with the teacher LM's knowledge. For efficiency, MiniPLM performs offline teacher inference, allowing KD for multiple student LMs without adding training costs. For flexibility, MiniPLM operates solely on the training corpus, enabling KD across model families. For effectiveness, MiniPLM leverages the differences between large and small LMs to enhance the training data difficulty and diversity, helping student LMs acquire versatile and sophisticated knowledge. Extensive experiments demonstrate that MiniPLM boosts the student LMs' performance on 9 common downstream tasks, improves language modeling capabilities, and reduces pre-training computation. The benefit of MiniPLM extends to larger training scales, evidenced by the scaling curve extrapolation. Further analysis reveals that MiniPLM supports KD across model families and enhances the pre-training data utilization. Our code, data, and models can be found at https://github.com/thu-coai/MiniPLM.

Abstract PDF HTML Upgrade to Chat

Authors (5)

Summary

The paper presents a novel offline distillation method that pre-trains student models without the high computational cost of continuous teacher inference.
It employs a difference sampling strategy to enhance data diversity and difficulty, overcoming limitations of tokenization dependencies in traditional knowledge distillation.
Experimental results across nine NLP tasks show that MiniPLM achieves superior zero-shot performance while significantly reducing computational expenses.

Overview of MINIPLM: Knowledge Distillation for Pre-Training LLMs

In the burgeoning field of artificial intelligence, particularly NLP, LLMs have demonstrated unprecedented efficacy across a spectrum of tasks. These models, exemplified by GPT-3 and similar architectures, have catalyzed significant advancements but are simultaneously characterized by their prohibitively large computational requirements. As a countermeasure, the concept of Knowledge Distillation (KD) has been explored to transfer the capabilities of these massive models into more economically viable "student" models. The paper entitled "MINIPLM: Knowledge Distillation for Pre-Training LLMs," authored by Yuxian Gu and colleagues, introduces an innovative framework—MINIPLM—that addresses the shortcomings of conventional KD methods in the pre-training phase, enhancing model efficiency, flexibility, and overall effectiveness.

Challenges in Traditional Knowledge Distillation

Knowledge Distillation typically involves a smaller, computationally inexpensive student model learning behaviors, outputs, and generalized knowledge from a larger teacher model. However, when applied during the pre-training stage, traditional KD approaches face multiple challenges:

Efficiency: Online KD requires continuous inference of the large teacher model which incurs excessive computational costs.
Flexibility: Existing methods often depend on tokenization matching between teacher and student models, limiting their application across different model families.
Effectiveness: The lack of difficulty and diversity in training data generated by the teacher can lead to student models that overfit simplistic patterns, hindering their generalization capabilities across varied downstream tasks.

MINIPLM Framework

MINIPLM addresses these deficiencies with a novel approach that fundamentally refines the distribution of training data via Difference Sampling. Key characteristics of this framework include:

Offline Teacher LM Inference: By computing the teacher model's knowledge offline, MINIPLM allows multiple student models to benefit from the distilled knowledge without additional training-time costs.
Difference Sampling Strategy: This approach selectively samples instances based on the probabilistic discrepancy between the teacher and a small reference LLM, enhancing data difficulty and diversity.
Model Architecture Agnosticism: MINIPLM operates purely on the corpus of training data which ensures compatibility across varied model architectures and tokenization schemes.

Experimental Validation and Key Findings

Across a broad set of experiments utilizing nine downstream NLP tasks, the efficacy of MINIPLM is rigorously demonstrated. The framework consistently outperforms traditional KD baselines, such as Vanilla KD and SeqKD, showcasing superior model improvements in terms of zero-shot performance—a critical indicator for assessing pretrained model capabilities. Notably, the application of MINIPLM offered substantial computational savings, enabling the same level of downstream task accuracy at substantially reduced computational expense. Importantly, the benefits extend to large-scale pre-training scenarios, emphasizing its scalability and robust performance across varied model families like Qwen, Llama3.1, and Mamba.

Practical and Theoretical Implications

The practical implications of the MINIPLM framework are substantial. By considerably reducing the computational overhead associated with maintaining large models while preserving the breadth and depth of knowledge captured by such models, MINIPLM paves the way for more efficient deployment of LLMs on resource-constrained systems without sacrificing performance. Theoretically, MINIPLM affirms the criticality of maintaining data diversity and difficulty during the pre-training process which can significantly impact the ability of the student models to generalize across novel tasks.

Future Prospects

While MINIPLM has clearly demonstrated its merits, the challenges surrounding KD across diverse LLM architectures remain an open field of inquiry. Future research directions may explore optimizing the size and configuration of teacher models for even larger-scale versions of student models, extending the applicability of the difference-sampled corpus for weak-to-strong generalization strategies, and addressing practical challenges around closed-source models or APIs. The integration of these findings could precipitate the next wave of innovation in language modeling and its applications.

Overall, by efficiently bridging the gap between large and small LMs, MINIPLM significantly contributes to the ongoing conversation about scalable, accessible, and computationally efficient artificial intelligence.

Markdown Report Issue