GOLD: Generalized Knowledge Distillation via Out-of-Distribution-Guided Language Data Generation
Abstract: Knowledge distillation from LLMs is essential for the efficient deployment of LLMs. Prior works have proposed data generation using LLMs for preparing distilled models. We argue that generating data with LLMs is prone to sampling mainly from the center of original content distribution. This limitation hinders the distilled model from learning the true underlying data distribution and to forget the tails of the distributions (samples with lower probability). To this end, we propose GOLD, a task-agnostic data generation and knowledge distillation framework, which employs an iterative out-of-distribution-guided feedback mechanism for the LLM. As a result, the generated data improves the generalizability of distilled models. An energy-based OOD evaluation approach is also introduced to deal with noisy generated data. Our extensive experiments on 10 different classification and sequence-to-sequence tasks in NLP show that GOLD respectively outperforms prior arts and the LLM with an average improvement of 5% and 14%. We will also show that the proposed method is applicable to less explored and novel tasks. The code is available.
- An empirical study of clinical note generation from doctor-patient encounters. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2291–2302.
- Generalized knowledge distillation for auto-regressive language models.
- EBJR: Energy-based joint reasoning for adaptive inference. BMVC 2021.
- E-LANG: Energy-based joint inferencing of super and swift language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5229–5244, Dublin, Ireland. Association for Computational Linguistics.
- System and method for unsupervised multi-model joint reasoning. US Patent App. 18/074,915.
- Beat the ai: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- ETran: Energy-based transferability estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 18613–18622.
- Knowledge distillation of large language models.
- The false promise of imitating proprietary llms.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, Toronto, Canada. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Energy-based out-of-distribution detection.
- OpenAI. 2023. GPT-4 Technical Report. arXiv e-prints.
- Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
- Nl4opt competition: Formulating optimization problems based on their natural language descriptions.
- The curse of recursion: Training on generated data makes models forget.
- Nassim Nicholas Taleb. 2007. Black swans and the domains of statistics. The American Statistician, 61(3):198–200.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Prompt2model: Generating deployable models from natural language instructions.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019.
- Symmetric cross entropy for robust learning with noisy labels. In IEEE International Conference on Computer Vision.
- ZeroGen: Efficient zero-shot learning via dataset generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11653–11669, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- ProGen: Progressive zero-shot dataset generation via in-context feedback. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3671–3683, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.