Knowledge Distillation Using Frontier Open-source LLMs: Generalizability and the Role of Synthetic Data
Abstract: Leading open-source LLMs such as Llama-3.1-Instruct-405B are extremely capable at generating text, answering questions, and solving a variety of natural language understanding tasks. However, they incur higher inference cost and latency compared to smaller LLMs. Knowledge distillation provides a way to use outputs from these large, capable teacher models to train smaller student models which can be used for inference at lower cost and latency, while retaining comparable accuracy. We investigate the efficacy of distillation using the Llama-3.1-405B-Instruct teacher and the smaller Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct student models. Contributions of this work include (a) We evaluate the generalizability of distillation with the above Llama-3.1 teacher-student pairs across different tasks and datasets (b) We show that using synthetic data during distillation significantly improves the accuracy of 8B and 70B models, and when used with reasoning chains, even matches or surpasses the zero-shot accuracy of 405B model on some datasets (c) We empirically show that distillation enables 8B and 70B models to internalize 405B's reasoning ability by using only standard fine-tuning (without customizing any loss function). This allows cost and latency-efficient student model inference. (d) We show pitfalls in evaluation of distillation, and present task-specific evaluation, including both human and LLM-grading, and ground-truth based traditional accuracy benchmarks. This methodical study brings out the fundamental importance of synthetic data quality in knowledge distillation, and of combining multiple, task-specific ways of accuracy and quality evaluation in assessing the effectiveness of distillation.
- Fine-tune meta llama models in azure ai studio, 2024a. URL https://learn.microsoft.com/en-us/azure/ai-studio/how-to/fine-tune-model-llama/.
- Azure marketplace, 2024b. URL https://azuremarketplace.microsoft.com/.
- From sparse to dense: Gpt-4 summarization with chain of density prompting. arXiv preprint arXiv:2309.04269, 2023.
- Generation, distillation and evaluation of motivational interviewing-style reflections with a foundational language model, 2024.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Dialogue chain-of-thought distillation for commonsense-aware conversational agents, 2023.
- Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems, 30, 2017.
- Learning to maximize mutual information for chain-of-thought distillation, 2024.
- Medically aware gpt-3 as a data generator for medical dialogue summarization. In Machine Learning for Healthcare Conference, pages 354–372. PMLR, 2021.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Data augmentation using llms: Data perspectives, learning paradigms and challenges. arXiv preprint arXiv:2403.02990, 2024.
- Distilling and transferring knowledge via cgan-generated samples for image classification and regression, 04 2021.
- Evaluation of generative ai applications with azure ai studio, 2024. URL https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-approach-gen-ai.
- Charlie George. Fine-tuning chatgpt: Surpassing gpt-4 summarization performance–a 63 URL https://blog.langchain.dev/fine-tuning-chatgpt-surpassing-gpt-4-summarization/.
- Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
- Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proc. 23rd International Conference on Machine learning (ICML’06), pages 377–384. ACM Press, 2006.
- Distilling the knowledge in a neural network, 2015.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes, 2023.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
- Efficient attentions for long document summarization, 2021.
- Comparison of soft and hard target rnn-t distillation for large-scale asr, 2022.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Why machine reading comprehension models learn shortcuts?, 2021.
- LangChain. Scoring evaluator, 2024. URL https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/string/scoring_eval_chain/.
- Best practices for llm evaluation of rag applications, 2023. URL https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG.
- Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
- Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. arXiv preprint arXiv:2306.14050, 2023.
- Shadow knowledge distillation: Bridging offline and online knowledge transfer. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 635–649. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/040d3b6af368bf71f952c18da5713b48-Paper-Conference.pdf.
- Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models, 2023.
- Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv preprint arXiv:1904.09482, 2019.
- Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379, 2023.
- Simpo: Simple preference optimization with a reference-free reward, 2024.
- Meta. Introducing llama 3.1: Our most capable models to date, 2024. URL https://ai.meta.com/blog/meta-llama-3-1/.
- State of what art? a call for multi-prompt llm evaluation, 2024.
- Council of Chief State School Officers. Common core state standards for mathematics, 2022. URL https://learning.ccsso.org/wp-content/uploads/2022/11/ADA-Compliant-Math-Standards.pdf. Accessed: 2024-06-21.
- Department of Education Louisiana. K-12 louisiana student standards for mathematics, 2017. URL https://www.louisianabelieves.com/docs/default-source/teacher-toolbox-resources/louisiana-student-standards-for-k-12-math.pdf. Accessed: 2024-06-21.
- California Department of Education Sacramento. Overview of the standards chapters of the mathematics framework for california public schools: Kindergarten through grade twelve, 2015. URL https://www.cde.ca.gov/ci/ma/cf/documents/mathfwoverview.pdf. Accessed: 2024-06-21.
- George Papamakarios. Distilling model knowledge, 2015.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Explain yourself! leveraging language models for commonsense reasoning. In Anna Korhonen, David Traum, and LluÃs Mà rquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1487. URL https://aclanthology.org/P19-1487.
- Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
- Can llms master math? investigating large language models on math stack exchange. In Proceedings of 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), Washington, USA, July. 2024. ACM.
- Learning by distilling context, 2022.
- What makes reading comprehension questions easier?, 2018.
- Self-criticism: Aligning large language models with their understanding of helpfulness, honesty, and harmlessness. In Mingxuan Wang and Imed Zitouni, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 650–662, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-industry.62. URL https://aclanthology.org/2023.emnlp-industry.62.
- Petter Törnberg. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588, 2023.
- Llama: Open and efficient foundation language models, 2023a. URL https://arxiv.org/abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Pinto: Faithful language reasoning using prompt-generated rationales. arXiv preprint arXiv:2211.01562, 2022a.
- Pinto: Faithful language reasoning using prompt-generated rationales, 2023.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022b.
- Towards zero-label language learning. arXiv preprint arXiv:2109.09193, 2021.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data, 2023.
- Feature normalized knowledge distillation for image classification. In European conference on computer vision, pages 664–680. Springer, 2020.
- A survey on knowledge distillation of large language models, 2024.
- A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4133–4141, 2017.
- Distilling system 2 into system 1, 2024. URL https://arxiv.org/abs/2407.06023.
- A comprehensive analysis of the effectiveness of large language models as automatic dialogue evaluators, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. arXiv preprint arXiv:2104.04670, 2021.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.