Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Abstract: Deploying LLMs is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset. We release the code at: https://github.com/google-research/distilling-step-by-step .
- Qameleon: Multilingual qa with only 5 examples. arXiv preprint arXiv:2211.08264.
- Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441.
- Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10925–10934.
- GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of the ACL Workshop on Challenges & Perspectives in Creating Large Language Models.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541.
- e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
- Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243–22255.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Honest students from untrusted teachers: Learning an interpretable question-answering pipeline from a pretrained language model. arXiv preprint arXiv:2210.02498.
- Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
- Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415.
- Peter Hase and Mohit Bansal. 2021. When can models learn from explanations? a formal framework for understanding the roles of explanation data. arXiv preprint arXiv:2102.02201.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7).
- Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610.
- Weighted distillation with unlabeled examples. In Advances in Neural Information Processing Systems.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
- Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. arXiv preprint arXiv:2306.14050.
- Mixkd: Towards efficient distillation of large-scale language models. arXiv preprint arXiv:2011.00593.
- Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
- A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984.
- Model reconstruction from model explanations. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 1–9.
- Wt5?! training text-to-text models to explain their predictions. arXiv preprint arXiv:2004.14546.
- Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
- Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
- Evaluating explanations: How much do explanations from the teacher aid students? Transactions of the Association for Computational Linguistics, 10:359–375.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy. Association for Computational Linguistics.
- Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717.
- Language models in the loop: Incorporating prompting into weak supervision. arXiv preprint arXiv:2205.02318.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
- Suraj Srinivas and François Fleuret. 2018. Knowledge transfer with jacobian matching. In International Conference on Machine Learning, pages 4723–4731. PMLR.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
- Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498.
- Pinto: Faithful language reasoning using prompt-generated rationales. arXiv preprint arXiv:2211.01562.
- Want to reduce labeling cost? gpt-3 can help. arXiv preprint arXiv:2108.13487.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178.
- Measuring association between labels and free-text rationales. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Using “annotator rationales” to improve machine learning for text categorization. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 260–267, Rochester, New York. Association for Computational Linguistics.
- Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465.
- Side-tuning: a baseline for network adaptation via additive side networks. In European Conference on Computer Vision, pages 698–714. Springer.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Rationale-augmented convolutional neural networks for text classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 795–804, Austin, Texas. Association for Computational Linguistics.
- Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.