Investigating Continual Pretraining in Large Language Models: Insights and Implications
Abstract: Continual learning (CL) in LLMs is an evolving domain that focuses on developing efficient and sustainable training strategies to adapt models to emerging knowledge and achieve robustness in dynamic environments. Our primary emphasis is on continual domain-adaptive pretraining, a process designed to equip LLMs with the ability to integrate new information from various domains while retaining previously learned knowledge. Since existing works concentrate mostly on continual fine-tuning for a limited selection of downstream tasks or training domains, we introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes. We further examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models. Our findings uncover several key insights: (i) continual pretraining consistently improves <1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and forgetting, (iv) continual pretraining boosts downstream task performance of GPT-2 family, (v) continual pretraining enables LLMs to specialize better when the sequence of domains shows semantic similarity while randomizing training domains leads to better transfer and final performance otherwise. We posit that our research establishes a new benchmark for CL in LLMs, providing a more realistic evaluation of knowledge retention and transfer across diverse domains.
- Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.
- BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
- Continual lifelong learning in natural language processing: A survey. arXiv preprint arXiv:2012.09823, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Lifelong language pretraining with distribution-specialized experts. In International Conference on Machine Learning, pp. 5383–5395. PMLR, 2023.
- Continual pre-training mitigates forgetting in language and vision. arXiv preprint arXiv:2205.09357, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Towards robust and efficient continual language learning. arXiv preprint arXiv:2307.05741, 2023.
- Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Continual pre-training of large language models: How to (re) warm your model? arXiv preprint arXiv:2308.04014, 2023.
- Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020.
- Demix layers: Disentangling domains for modular language modeling. arXiv preprint arXiv:2108.05036, 2021.
- Lifelong pretraining: Continually adapting language models to emerging corpora. arXiv preprint arXiv:2110.08534, 2021.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Continual pre-training of language models. In The Eleventh International Conference on Learning Representations, 2023a.
- Adapting a language model while preserving its general knowledge. arXiv preprint arXiv:2301.08986, 2023b.
- Lifelong language learning with adapter based transformers. In Continual Lifelong Learning Workshop at ACML 2022, 2022.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Roberta: A robustly optimized bert pretraining approach, 2020. URL https://openreview.net/forum?id=SyxS0T4tvS.
- S2orc: The semantic scholar open research corpus. arXiv preprint arXiv:1911.02782, 2019.
- Estimating the carbon footprint of bloom, a 176b parameter language model, 2022.
- Investigating forgetting in pre-trained representations through continual learning. arXiv preprint arXiv:2305.05968, 2023a.
- An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023b.
- An empirical investigation of the role of pre-training in lifelong learning. Journal of Machine Learning Research, 24(214):1–50, 2023.
- Rdumb: A simple approach that questions our progress in continual test-time adaptation. arXiv preprint arXiv:2306.05401, 2023.
- Elle: Efficient lifelong pre-training for emerging data. arXiv preprint arXiv:2203.06311, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2021.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506, 2020.
- Progressive prompts: Continual learning for language models. arXiv preprint arXiv:2301.12314, 2023.
- M2d2: A massively multi-domain language modeling dataset. arXiv preprint arXiv:2210.07370, 2022.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL http://arxiv.org/abs/1908.10084.
- Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 6107–6122, 2022.
- Lamol: Language modeling for lifelong language learning. arXiv preprint arXiv:1909.03329, 2019.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Orthogonal subspace learning for language model continual learning. arXiv preprint arXiv:2310.14152, 2023.
- Pretrained language model in continual learning: A comparative study. In International Conference on Learning Representations, 2021.
- Continual learning for large language models: A survey. arXiv preprint arXiv:2402.01364, 2024.
- Bert post-training for review reading comprehension and aspect-based sentiment analysis. arXiv preprint arXiv:1904.02232, 2019.
- Continual sequence generation with adaptive compositional modules. arXiv preprint arXiv:2203.10652, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.