Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training
Abstract: We explore the training dynamics of neural networks in a structured non-IID setting where documents are presented cyclically in a fixed, repeated sequence. Typically, networks suffer from catastrophic interference when training on a sequence of documents; however, we discover a curious and remarkable property of LLMs finetuned sequentially in this setting: they exhibit anticipatory behavior, recovering from the forgetting on documents before encountering them again. This behavior occurs even though the documents are never presented in context together. The behavior emerges and becomes more robust as the architecture scales up its number of parameters. Through comprehensive experiments and visualizations, we demonstrate a new mechanism by which over-parametrized neural networks can recover from catastrophic interference and uncover new insights into training over-parameterized networks in cyclically structured environments.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- To repeat or not to repeat: Insights from scaling llm under token-crisis. Advances in Neural Information Processing Systems, 36, 2024.
- Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Datasheet for the Pile. arXiv preprint arXiv:2201.07311, 2022.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, 2016.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020.
- Learning multiple layers of features from tiny images. 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
- SGD with shuffling: Optimal rates without component convexity and large epoch requirements. Advances in Neural Information Processing Systems, 33:17526–17535, 2020.
- Convergence rate of incremental gradient and incremental newton methods. SIAM Journal on Optimization, 29(4):2542–2565, 2019.
- Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33:17309–17320, 2020.
- How good is SGD with random shuffling? In Conference on Learning Theory, pages 3250–3284. PMLR, 2020.
- Stochastic gradient descent without full data shuffle. arXiv preprint arXiv:2206.05830, 2022.
- Learning in temporally structured environments. 2023.
- James Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
- Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pages 928–936, 2003.
- Prediction, learning, and games. Cambridge University Press, 2006.
- Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
- Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566–1575. PMLR, 2019a.
- Online meta-learning. In International Conference on Machine Learning, pages 1920–1930. PMLR, 2019.
- Online-within-online meta-learning. Advances in Neural Information Processing Systems, 32, 2019b.
- Meta-learning representations for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
- Online continual learning under extreme memory constraints. In Proceedings of the European Conference on Computer Vision, pages 720–735. Springer, 2020.
- Wandering within a world: Online contextualized few-shot learning. In International Conference on Learning Representations, 2021.
- Wanderlust: Online continual object detection in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10829–10838, 2021.
- Learning by on-line gradient descent. Journal of Physics A: Mathematical and general, 28(3):643, 1995.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
- Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Quantifying memorization across neural language models. In International Conference on Learning Representations, 2023.
- AÂ Emin Orhan. Recognition, recall, and retention of few-shot memories in large language models. arXiv preprint arXiv:2303.17557, 2023.
- Brain-inspired replay for continual learning with artificial neural networks. Nature Communications, 11(1):4069, 2020.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
- Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987–3995. PMLR, 2017.
- Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision, pages 139–154, 2018.
- iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
- Experience replay for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
- On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017.
- Dark experience for general continual learning: a strong, simple baseline. Advances in Neural Information Processing Systems, 33:15920–15930, 2020.
- Heterogeneous continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15985–15995, 2023.
- Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations, 2018.
- Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018.
- NISPA: Neuro-inspired stability-plasticity adaptation for continual learning in sparse networks. In International Conference on Machine Learning, pages 8157–8174. PMLR, 2022.
- Forget-free continual learning with winning subnetworks. In International Conference on Machine Learning, pages 10734–10750. PMLR, 2022.
- Sequential mastery of multiple visual tasks: Networks naturally learn to learn and forget to forget. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 9282–9293, 2020.
- A simple baseline that questions the use of pretrained-models in continual learning. In NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022.
- Do pre-trained models benefit equally in continual learning? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6485–6493, 2023.
- Self-supervised models are continual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2022.
- Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, 2022.
- Continual pre-training of language models. In International Conference on Learning Representations, 2022.
- Multitask learning via interleaving: A neural network investigation. In M. Goldwater, F. K. Anggoro, B. K. Hayes, and D. C. Ong, editors, Proceedings of the 45th Annual Conference of the Cognitive Science Society, volume 45, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.