Knowledge Distillation Using Frontier Open-source LLMs: Generalizability and the Role of Synthetic Data

Published 24 Oct 2024 in cs.LG | (2410.18588v1)

Abstract: Leading open-source LLMs such as Llama-3.1-Instruct-405B are extremely capable at generating text, answering questions, and solving a variety of natural language understanding tasks. However, they incur higher inference cost and latency compared to smaller LLMs. Knowledge distillation provides a way to use outputs from these large, capable teacher models to train smaller student models which can be used for inference at lower cost and latency, while retaining comparable accuracy. We investigate the efficacy of distillation using the Llama-3.1-405B-Instruct teacher and the smaller Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct student models. Contributions of this work include (a) We evaluate the generalizability of distillation with the above Llama-3.1 teacher-student pairs across different tasks and datasets (b) We show that using synthetic data during distillation significantly improves the accuracy of 8B and 70B models, and when used with reasoning chains, even matches or surpasses the zero-shot accuracy of 405B model on some datasets (c) We empirically show that distillation enables 8B and 70B models to internalize 405B's reasoning ability by using only standard fine-tuning (without customizing any loss function). This allows cost and latency-efficient student model inference. (d) We show pitfalls in evaluation of distillation, and present task-specific evaluation, including both human and LLM-grading, and ground-truth based traditional accuracy benchmarks. This methodical study brings out the fundamental importance of synthetic data quality in knowledge distillation, and of combining multiple, task-specific ways of accuracy and quality evaluation in assessing the effectiveness of distillation.

Abstract PDF HTML Upgrade to Chat

References (67)

Summary

The paper presents a robust distillation framework that leverages synthetic data to transfer knowledge from a large teacher model to smaller student models.
It employs chain-of-thought and chain-of-density prompting to enhance task-specific reasoning and comprehension capabilities.
Experimental results demonstrate that distilled models reduce computational costs while matching or exceeding teacher performance on key benchmarks.

Knowledge Distillation Using Frontier Open-Source LLMs: Generalizability and the Role of Synthetic Data

The presented paper investigates the concept of knowledge distillation in the context of LLMs with a specific focus on the Llama-3.1-Instruct series. With Llama-3.1-405B-Instruct as the teacher model, the authors examine the efficiency and effectiveness of distilling knowledge into smaller student models, namely, Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct. The study outlines a methodological framework aimed at reducing computational costs associated with powerful, large-scale models while maintaining high performance levels on various natural language tasks.

The methodological contributions of this study are numerous, beginning with an evaluation of the distillation process across different tasks and datasets. The researchers concentrated on ensuring the small student models retained the high reasoning and comprehension capabilities of the teacher model. They advocate for the use of synthetic data, which was demonstrated to significantly elevate the performance of students models, often reaching or surpassing the zero-shot efficacy of the much larger teacher on specific datasets.

Methodology

The authors describe a systematic two-step process of knowledge distillation: generating outputs using advanced, task-specific prompts from the teacher model, and subsequently fine-tuning the student models using these outputs. Their methodology leverages task-specific synthetic data, derived from prompts that enhance the training data quality. These advanced strategies notably involve chain-of-thought (CoT) and chain-of-density prompting aimed at ensuring that crucial nuances and reasoning steps are replicated in the distilled models.

Results

The paper provides experimental evidence that supports the viability and effectiveness of the proposed distillation strategy:

Summarization Tasks: Distillation with chain-of-density prompting resulted in models that outperformed the larger teacher LLM's vanilla-prompted outputs—achieving notably higher entity densities in summaries.
Conversational Tasks: On conversational datasets, the distilled 70B model surpassed the base teacher model's performance on certain evaluation metrics, showing strong alignment with desired response qualities.
Natural Language Understanding Tasks: In NLU tasks, particularly for natural language inference and question-answering, student models distilled using CoT outputs often outperformed vanilla prompted student models. However, in the case of more complex mathematical reasoning tasks, direct CoT prompting was shown to be crucial over distillation, pointing to the limitations of simplification via distillation in such scenarios.

Implications

The implications of these findings are multifaceted. From a practical perspective, the approach offers a substantial reduction in inference costs—which is critical for deployment at scale without sacrificing performance. Theoretically, the work underscores the potential of synthetic data to embody complex reasoning and knowledge transfer processes. Furthermore, the research highlights possible limitations of distillation regarding intricate problem-solving abilities, which may still require direct inference from capable models for optimal accuracy.

Future Directions

Future explorations could deepen the understanding of how different synthetic data generation strategies impact diverse LLM capabilities. Refining evaluation frameworks to better capture nuanced competency in conversational agents or further expanding the range of tasks might provide additional insights. The integration of more advanced or mixed distillation methods, with improved faithfulness and comprehension, remains a promising avenue for research.

Overall, this paper offers significant insights and a robust framework for leveraging knowledge distillation to optimize the efficiency of LLMs, with substantial benefits for real-world applications.

Markdown