PANDA: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation
Abstract: Prompt Transfer (PoT) is a recently-proposed approach to improve prompt-tuning, by initializing the target prompt with the existing prompt trained on similar source tasks. However, such a vanilla PoT approach usually achieves sub-optimal performance, as (i) the PoT is sensitive to the similarity of source-target pair and (ii) directly fine-tuning the prompt initialized with source prompt on target task might lead to forgetting of the useful general knowledge learned from source task. To tackle these issues, we propose a new metric to accurately predict the prompt transferability (regarding (i)), and a novel PoT approach (namely PANDA) that leverages the knowledge distillation technique to alleviate the knowledge forgetting effectively (regarding (ii)). Extensive and systematic experiments on 189 combinations of 21 source and 9 target datasets across 5 scales of PLMs demonstrate that: 1) our proposed metric works well to predict the prompt transferability; 2) our PANDA consistently outperforms the vanilla PoT approach by 2.3% average score (up to 24.1%) among all tasks and model sizes; 3) with our PANDA approach, prompt-tuning can achieve competitive and even better performance than model-tuning in various PLM scales scenarios. We have publicly released our code in https://github.com/WHU-ZQH/PANDA.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv, 2019.
- P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” in ICLR, 2020.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” JMLR, 2020.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in NeurIPS, 2020.
- F. Yuan, X. He, A. Karatzoglou, and L. Zhang, “Parameter-efficient transfer from sequential behaviors for user modeling and recommendation,” in SIGIR, 2020.
- E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” in ICLR, 2022.
- B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in EMNLP, 2021.
- X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in ACL, 2021.
- T. Vu, B. Lester, N. Constant, R. Al-Rfou, and D. Cer, “SPoT: Better frozen model adaptation through soft prompt transfer,” in ACL, 2022.
- X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “Gpt understands, too,” arXiv, 2021.
- Y. Pruksachatkun, J. Phang, H. Liu, P. M. Htut, X. Zhang, R. Y. Pang, C. Vania, K. Kann, and S. Bowman, “Intermediate-task transfer learning with pretrained language models: When and why does it work?” in ACL, 2020.
- Y. Su, X. Wang, Y. Qin, C.-M. Chan, Y. Lin, H. Wang, K. Wen, Z. Liu, P. Li, J. Li et al., “On transferability of prompt tuning for natural language processing,” in NAACL, 2022.
- S. Chen, Y. Hou, Y. Cui, W. Che, T. Liu, and X. Yu, “Recall and learn: Fine-tuning deep pretrained language models with less forgetting,” in EMNLP, 2020.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
- M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, “Spanbert: Improving pre-training by representing and predicting spans,” TACL, 2020.
- M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in ACL, 2020.
- Q. Zhong, L. Ding, Y. Zhan, Y. Qiao, Y. Wen, L. Shen, J. Liu, B. Yu, B. Du, Y. Chen et al., “Toward efficient language model pretraining and downstream adaptation via self-evolution: A case study on superglue,” arXiv, 2022.
- Q. Zhong, L. Ding, K. Peng, J. Liu, B. Du, L. Shen, Y. Zhan, and D. Tao, “Bag of tricks for effective language model pretraining and downstream adaptation: A case study on glue,” arXiv, 2023.
- Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, “Self-evolution learning for discriminative language model pretraining,” in Findings of ACL, 2023.
- P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, 2023.
- Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, “E2s2: Encoding-enhanced sequence-to-sequence pretraining for language understanding and generation,” IEEE Transactions on Knowledge and Data Engineering, 2023.
- R. Guan, H. Zhang, Y. Liang, F. Giunchiglia, L. Huang, and X. Feng, “Deep feature-based text clustering and its explanation,” IEEE Transactions on Knowledge and Data Engineering, 2020.
- Q. Zhong, L. Ding, J. Liu, B. Du, H. Jin, and D. Tao, “Knowledge graph augmented network towards multiview representation learning for aspect-based sentiment analysis,” IEEE Transactions on Knowledge and Data Engineering, 2023.
- J. Li, B. Chiu, S. Feng, and H. Wang, “Few-shot named entity recognition via meta-learning,” IEEE Transactions on Knowledge and Data Engineering, 2020.
- L. Chen, F. Yuan, J. Yang, X. He, C. Li, and M. Yang, “User-specific adaptive fine-tuning for cross-domain recommendations,” IEEE Transactions on Knowledge and Data Engineering, 2021.
- J. Li, A. Sun, and Y. Ma, “Neural named entity boundary detection,” IEEE Transactions on Knowledge and Data Engineering, 2020.
- J. Li, A. Sun, J. Han, and C. Li, “A survey on deep learning for named entity recognition,” IEEE Transactions on Knowledge and Data Engineering, 2020.
- Q. Zhong, L. Ding, L. Shen, P. Mi, J. Liu, B. Du, and D. Tao, “Improving sharpness-aware minimization with fisher mask for better generalization on language models,” in Findings of EMNLP, 2022.
- Q. Zhong, L. Ding, J. Liu, X. Liu, M. Zhang, B. Du, and D. Tao, “Revisiting token dropping strategy in efficient bert pretraining,” in ACL, 2023.
- R. Xu, F. Luo, Z. Zhang, C. Tan, B. Chang, S. Huang, and F. Huang, “Raise a child in large language model: Towards effective and generalizable fine-tuning,” in EMNLP, 2021.
- N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in ICML, 2019.
- H. Huang, X. Liu, G. Shi, and Q. Liu, “Event extraction with dynamic prefix tuning and relevance retrieval,” IEEE Transactions on Knowledge and Data Engineering, 2023.
- X. Liu, K. Ji, Y. Fu, W. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” in ACL, 2022.
- Y. Gu, X. Han, Z. Liu, and M. Huang, “PPT: Pre-trained prompt tuning for few-shot learning,” in ACL, 2022.
- T. Schick and H. Schütze, “Few-shot text generation with pattern-exploiting training,” in EMNLP, 2021.
- T. Schick and H. Schutze, “Exploiting cloze-questions for few-shot text classification and natural language inference,” in ACL, 2021.
- T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,” in EMNLP, 2020.
- X. Han, W. Zhao, N. Ding, Z. Liu, and M. Sun, “Ptr: Prompt tuning with rules for text classification,” AI Open, 2022.
- A. Asai, M. Salehi, M. E. Peters, and H. Hajishirzi, “Attempt: Parameter-efficient multi-task tuning via attentional mixtures of soft prompts,” in EMNLP, 2022.
- X. Peng, C. Xing, P. K. Choubey, C.-S. Wu, and C. Xiong, “Model ensemble instead of prompt fusion: a sample-specific knowledge transfer method for few-shot prompt tuning,” in ICLR, 2023.
- G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a neural network,” in NeurIPS, 2015.
- G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge distillation meets self-supervision,” in ECCV, 2020.
- T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, “Born again neural networks,” in ICML, 2018.
- Q. Zhong, L. Ding, L. Shen, J. Liu, B. Du, and D. Tao, “Revisiting knowledge distillation for autoregressive language models,” arXiv, 2024.
- L. Ding, L. Wang, X. Liu, D. F. Wong, D. Tao, and Z. Tu, “Understanding and improving lexical choice in non-autoregressive translation,” in ICLR, 2021.
- S. Hahn and H. Choi, “Self-knowledge distillation in natural language processing,” in RANLP, 2019.
- S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith, “Don’t stop pretraining: Adapt language models to domains and tasks,” in ACL, 2020.
- T. Jiang, J. Jiao, S. Huang, Z. Zhang, D. Wang, F. Zhuang, F. Wei, H. Huang, D. Deng, and Q. Zhang, “Promptbert: Improving bert sentence embeddings with prompts,” in EMNLP, 2022.
- B. Wang, L. Ding, Q. Zhong, X. Li, and D. Tao, “A contrastive cross-channel data augmentation framework for aspect-based sentiment analysis,” in COLING, 2022.
- A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” in EMNLP, 2018.
- A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Superglue: A stickier benchmark for general-purpose language understanding systems,” in NeurIPS, 2019.
- P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100, 000+ questions for machine comprehension of text,” in EMNLP, 2016.
- E. T. K. Sang and F. De Meulder, “Introduction to the conll-2003 shared task: Language-independent named entity recognition,” in NAACL, 2003.
- X. Carreras and L. Màrquez, “Introduction to the CoNLL-2004 shared task: Semantic role labeling,” in NAACL, 2004.
- E. Hovy, M. Marcus, M. Palmer, L. Ramshaw, and R. Weischedel, “Ontonotes: the 90% solution,” in NAACL, 2006.
- X. Carreras and L. Màrquez, “Introduction to the conll-2005 shared task: Semantic role labeling,” in CoNLL-2005, 2005.
- S. Pradhan, A. Moschitti, N. Xue, O. Uryupina, and Y. Zhang, “CoNLL-2012 shared task: Modeling multilingual unrestricted coreference in OntoNotes,” in EMNLP, 2012.
- S. Stanton, P. Izmailov, P. Kirichenko, A. A. Alemi, and A. G. Wilson, “Does knowledge distillation really work?” in NeurIPS, 2021.
- T. Kim, J. Oh, N. Y. Kim, S. Cho, and S.-Y. Yun, “Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation,” in IJCAI, 2021.
- S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” arXiv, 2022.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv, 2023.
- C. Spearman, “The proof and measurement of association between two things.” American Journal of Psychology, 1904.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.