Papers
Topics
Authors
Recent
Search
2000 character limit reached

What the Weight?! A Unified Framework for Zero-Shot Knowledge Composition

Published 23 Jan 2024 in cs.CL and cs.AI | (2401.12756v2)

Abstract: The knowledge encapsulated in a model is the core factor determining its final performance on downstream tasks. Much research in NLP has focused on efficient methods for storing and adapting different types of knowledge, e.g., in dedicated modularized structures, and on how to effectively combine these, e.g., by learning additional parameters. However, given the many possible options, a thorough understanding of the mechanisms involved in these compositions is missing, and hence it remains unclear which strategies to utilize. To address this research gap, we propose a novel framework for zero-shot module composition, which encompasses existing and some novel variations for selecting, weighting, and combining parameter modules under a single unified notion. Focusing on the scenario of domain knowledge and adapter layers, our framework provides a systematic unification of concepts, allowing us to conduct the first comprehensive benchmarking study of various zero-shot knowledge composition strategies. In particular, we test two module combination methods and five selection and weighting strategies for their effectiveness and efficiency in an extensive experimental setup. Our results highlight the efficacy of ensembling but also hint at the power of simple though often-ignored weighting methods. Further in-depth analyses allow us to understand the role of weighting vs. top-k selection, and show that, to a certain extent, the performance of adapter composition can even be predicted.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. MAD-G: Multilingual adapter generation for efficient cross-lingual transfer. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4762–4781, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  2. Ankur Bapna and Orhan Firat. 2019. Simple, scalable adaptation for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1538–1548, Hong Kong, China. Association for Computational Linguistics.
  3. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
  4. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. CoRR, abs/1607.06520.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  6. Adaptersoup: Weight averaging to improve generalization of pretrained language models.
  7. Multilingual domain adaptation for NMT: Decoupling language and domain information with adapters. In Proceedings of the Sixth Conference on Machine Translation, pages 578–598, Online. Association for Computational Linguistics.
  8. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  9. Injecting domain knowledge in language models for task-oriented dialogue systems. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11962–11974, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  10. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39.
  11. Scalearn: Simple and highly parameter-efficient task transfer by learning to scale. arXiv preprint arXiv:2310.01217.
  12. Training and domain adaptation for supervised text segmentation. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, pages 110–116, Online. Association for Computational Linguistics.
  13. Parameter-efficient transfer learning with diff pruning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4884–4896, Online. Association for Computational Linguistics.
  14. Multi-source domain adaptation with mixture of experts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4694–4703, Brussels, Belgium. Association for Computational Linguistics.
  15. DEMix layers: Disentangling domains for modular language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5557–5576, Seattle, United States. Association for Computational Linguistics.
  16. Scaling expert language models with unsupervised domain discovery.
  17. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  18. Towards climate awareness in NLP research. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2480–2494, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  19. Fair and argumentative language modeling for computational argumentation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7841–7861, Dublin, Ireland. Association for Computational Linguistics.
  20. Parameter-efficient transfer learning for nlp.
  21. Can demographic factors improve text classification? revisiting demographic adaptation in the age of transformers. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1565–1580, Dubrovnik, Croatia. Association for Computational Linguistics.
  22. DS-TOD: Efficient domain specialization for task-oriented dialog. In Findings of the Association for Computational Linguistics: ACL 2022, pages 891–904, Dublin, Ireland. Association for Computational Linguistics.
  23. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87.
  24. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  25. Beyond distillation: Task-level mixture-of-experts for efficient inference. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3577–3599, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  26. Sustainable modular debiasing of language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4782–4797, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  27. Common sense or world knowledge? investigating adapter-based knowledge injection into pretrained transformers. In Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 43–49, Online. Association for Computational Linguistics.
  28. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  29. A modern perspective on query likelihood with deep generative retrieval models. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ’21, page 185–195, New York, NY, USA. Association for Computing Machinery.
  30. Branch-train-merge: Embarrassingly parallel training of expert language models.
  31. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
  32. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning, pages 6565–6576. PMLR.
  33. Parameter-efficient domain knowledge integration from multiple sources for biomedical pre-trained language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3855–3865, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  34. Compacter: Efficient low-rank hypercomplex adapter layers. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 1022–1035.
  35. UDAPTER - efficient domain adaptation using adapters. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2249–2263, Dubrovnik, Croatia. Association for Computational Linguistics.
  36. Soft merging of experts with adaptive routing. arXiv preprint arXiv:2306.03745.
  37. Nationality bias in text generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 116–122, Dubrovnik, Croatia. Association for Computational Linguistics.
  38. AdapterFusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503, Online. Association for Computational Linguistics.
  39. Modular deep learning. CoRR, abs/2302.11529.
  40. Monolingual adapters for zero-shot neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4465–4470, Online. Association for Computational Linguistics.
  41. Barbara Plank. 2016. What to do about non-standard (or non-canonical) language in NLP. ArXiv preprint, abs/1608.07836.
  42. Combining parameter-efficient modules for task-level generalisation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 687–702, Dubrovnik, Croatia. Association for Computational Linguistics.
  43. Language models are unsupervised multitask learners. Technical report, OpenAI.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  45. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 506–516.
  46. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  47. Danielle Saunders. 2021. Domain adaptation and multi-domain adaptation for neural machine translation: A survey. ArXiv preprint, abs/2104.06951.
  48. Energy and policy considerations for modern deep learning research. Proceedings of the AAAI Conference on Artificial Intelligence, 34(09):13693–13696.
  49. Learning sparse sharing architectures for multiple tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8936–8943.
  50. Fewer errors, but more stereotypes? the effect of model size on gender bias. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 112–120, Seattle, Washington. Association for Computational Linguistics.
  51. Zeerak Talat and Anne Lauscher. 2022. Back to the future: On potential histories in nlp. arXiv preprint arXiv:2210.06245.
  52. No language left behind: Scaling human-centered machine translation.
  53. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.
  54. UDapter: Typology-based language adapters for multilingual dependency parsing and sequence labeling. Computational Linguistics, 48(3):555–592.
  55. K-adapter: Infusing knowledge into pre-trained models with adapters. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 1405–1418. Association for Computational Linguistics.
  56. Efficient test time adapter ensembling for low-resource language varieties. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 730–737, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  57. AdaMix: Mixture-of-adaptations for parameter-efficient model tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5744–5760, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  58. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  59. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.
  60. One network, many masks: Towards more parameter-efficient transfer learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7564–7580, Toronto, Canada. Association for Computational Linguistics.
  61. Meta-dmoe: Adapting to domain shift by meta-distillation from mixture-of-experts.
Citations (4)

Summary

  • The paper introduces a unified framework for zero-shot knowledge composition that utilizes adapter layers to select, weight, and combine domain-specific knowledge without additional training.
  • It benchmarks five adapter weighting strategies across 21 training and 10 evaluation domains, with ensembling consistently outperforming parameter averaging.
  • The research provides meta-regression analysis and publicly available resources, ensuring reproducibility and guiding future investigations in knowledge modularization and domain adaptation.

Introduction

Pre-trained LLMs (PLMs) such as GPT and BERT have dramatically advanced the field of NLP, which can be attributed to the vast amount of knowledge encapsulated within their parameters. In pursuit of optimizing the use of PLMs for domain-specific tasks, a considerable amount of research has focused on strategies for knowledge modularization and composition. Particularly in zero-shot settings, the goal is to leverage and combine knowledge from various pre-trained modules to improve performance on target domains without additional training.

A Unified Composition Framework

The paper introduces a novel, comprehensive framework for zero-shot knowledge composition, applicable across various scenarios and modular structures. It centers around the concept of adapter layers for domain adaptation and unfolds in three conceptual steps: selecting relevant adapters, weighting them, and performing the final combination. The paper details five unique scoring strategies for adapter selection and weighting: Uniform, Semantic Sentence Similarity, TF–IDF, Domain Prior, and Entropy. Utilizing these strategies, the framework tests combinations through both parameter averaging, akin to "model souping," and output vector ensembling.

Benchmarking Composition Strategies

Extensive experiments across 21 training and 10 evaluation domains, involving three distinct models (gpt2-base, gpt2-large, deberta-base), benchmark the composition strategies for zero-shot domain adaptation. Results demonstrate that ensembling typically surpasses parameter averaging regarding effectiveness. Furthermore, against expectations, corpus-based strategies such as tf–idf and sentence similarity often outperformed more complex model-based approaches in adapter weighting and provided greater efficiency. Meta-regression analysis was also conducted to predict the performance of adapter combinations on unseen domains, proving partially successful especially for specific adapter compositions.

This study builds on extensive literature concerning knowledge modularization and composition, differentiating itself by providing a unified framework and analysis across various methods. The paper's experimental settings and resources are detailed to ensure reproducibility. For complete transparency and to support further research, the authors have made the code and models publicly available.

Conclusion

In summary, this research presents a unified approach to zero-shot knowledge composition with a detailed benchmarking study to evaluate various strategies. It highlights the efficacy of ensembling over parameter averaging and the surprising effectiveness and efficiency of simplistic adapter weighting techniques. Through meta-regression, it also opens avenues for predicting the performance of domain adaptation methods, streamlining future explorations. With its publication, the authors encourage further investigations into effective knowledge composition, aiming to further enhance the adaptability and efficiency of NLP systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 15 likes about this paper.