Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training

Published 14 Mar 2024 in cs.LG and cs.CL | (2403.09613v2)

Abstract: We explore the training dynamics of neural networks in a structured non-IID setting where documents are presented cyclically in a fixed, repeated sequence. Typically, networks suffer from catastrophic interference when training on a sequence of documents; however, we discover a curious and remarkable property of LLMs finetuned sequentially in this setting: they exhibit anticipatory behavior, recovering from the forgetting on documents before encountering them again. This behavior occurs even though the documents are never presented in context together. The behavior emerges and becomes more robust as the architecture scales up its number of parameters. Through comprehensive experiments and visualizations, we demonstrate a new mechanism by which over-parametrized neural networks can recover from catastrophic interference and uncover new insights into training over-parameterized networks in cyclically structured environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
  2. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  3. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  4. OpenAI. Gpt-4 technical report, 2023.
  5. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  6. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  7. To repeat or not to repeat: Insights from scaling llm under token-crisis. Advances in Neural Information Processing Systems, 36, 2024.
  8. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989.
  9. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  10. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  11. Datasheet for the Pile. arXiv preprint arXiv:2201.07311, 2022.
  12. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, 2016.
  13. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  14. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020.
  15. Learning multiple layers of features from tiny images. 2009.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  17. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  18. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
  19. SGD with shuffling: Optimal rates without component convexity and large epoch requirements. Advances in Neural Information Processing Systems, 33:17526–17535, 2020.
  20. Convergence rate of incremental gradient and incremental newton methods. SIAM Journal on Optimization, 29(4):2542–2565, 2019.
  21. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33:17309–17320, 2020.
  22. How good is SGD with random shuffling? In Conference on Learning Theory, pages 3250–3284. PMLR, 2020.
  23. Stochastic gradient descent without full data shuffle. arXiv preprint arXiv:2206.05830, 2022.
  24. Learning in temporally structured environments. 2023.
  25. James Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
  26. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning, pages 928–936, 2003.
  27. Prediction, learning, and games. Cambridge University Press, 2006.
  28. Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
  29. Learning-to-learn stochastic gradient descent with biased regularization. In International Conference on Machine Learning, pages 1566–1575. PMLR, 2019a.
  30. Online meta-learning. In International Conference on Machine Learning, pages 1920–1930. PMLR, 2019.
  31. Online-within-online meta-learning. Advances in Neural Information Processing Systems, 32, 2019b.
  32. Meta-learning representations for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
  33. Online continual learning under extreme memory constraints. In Proceedings of the European Conference on Computer Vision, pages 720–735. Springer, 2020.
  34. Wandering within a world: Online contextualized few-shot learning. In International Conference on Learning Representations, 2021.
  35. Wanderlust: Online continual object detection in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10829–10838, 2021.
  36. Learning by on-line gradient descent. Journal of Physics A: Mathematical and general, 28(3):643, 1995.
  37. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  38. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a.
  39. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
  40. Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022.
  41. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  42. Quantifying memorization across neural language models. In International Conference on Learning Representations, 2023.
  43. A Emin Orhan. Recognition, recall, and retention of few-shot memories in large language models. arXiv preprint arXiv:2303.17557, 2023.
  44. Brain-inspired replay for continual learning with artificial neural networks. Nature Communications, 11(1):4069, 2020.
  45. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
  46. Continual learning through synaptic intelligence. In International Conference on Machine Learning, pages 3987–3995. PMLR, 2017.
  47. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision, pages 139–154, 2018.
  48. iCaRL: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  49. Experience replay for continual learning. Advances in Neural Information Processing Systems, 32, 2019.
  50. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019.
  51. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  52. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017.
  53. Dark experience for general continual learning: a strong, simple baseline. Advances in Neural Information Processing Systems, 33:15920–15930, 2020.
  54. Heterogeneous continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15985–15995, 2023.
  55. Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations, 2018.
  56. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning, pages 4548–4557. PMLR, 2018.
  57. NISPA: Neuro-inspired stability-plasticity adaptation for continual learning in sparse networks. In International Conference on Machine Learning, pages 8157–8174. PMLR, 2022.
  58. Forget-free continual learning with winning subnetworks. In International Conference on Machine Learning, pages 10734–10750. PMLR, 2022.
  59. Sequential mastery of multiple visual tasks: Networks naturally learn to learn and forget to forget. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 9282–9293, 2020.
  60. A simple baseline that questions the use of pretrained-models in continual learning. In NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022.
  61. Do pre-trained models benefit equally in continual learning? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6485–6493, 2023.
  62. Self-supervised models are continual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2022.
  63. Fine-tuned language models are continual learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, 2022.
  64. Continual pre-training of language models. In International Conference on Learning Representations, 2022.
  65. Multitask learning via interleaving: A neural network investigation. In M. Goldwater, F. K. Anggoro, B. K. Hayes, and D. C. Ong, editors, Proceedings of the 45th Annual Conference of the Cognitive Science Society, volume 45, 2023.
Citations (1)

Summary

  • The paper demonstrates that structured cyclic training triggers anticipatory recovery in LLMs, effectively mitigating catastrophic interference in sequential document learning.
  • It employs systematic experiments on Pythia models ranging up to 2.8B parameters to quantify recovery through reduced training losses.
  • Findings suggest that training regimes with repeated data exposures can enhance model retention and extend applicability to both natural language and vision tasks.

Reawakening Knowledge: Anticipatory Recovery from Catastrophic Interference via Structured Training

Introduction

Advancements in understanding the dynamics of LLMs during structured training scenarios have identified a noteworthy phenomenon known as anticipatory recovery, wherein models exhibit an ability to counteract catastrophic interference--a condition where the learning of new information leads to the forgetting of old information. This study investigates the behavior of LLMs when subjected to a structured, non-IID, cyclic training regimen, focusing on how these models can spontaneously recover knowledge about documents before re-encountering them during training sequences. This anticipatory behavior is intensified as the model size escalates, unveiling a potentially emergent property in over-parameterized networks.

Data and Experiment Setup

In our investigation, we employed various configurations of the Pythia model, ranging from 160M to 2.8B parameters, applied on the CNN/Daily Mail news dataset. This setup diverges from typical LLM training methodologies by not only repeating the sequence of documents across epochs but also allowing multiple gradient steps per document, akin to reading the same set of chapters repeatedly to enhance understanding. Such a setup illuminates how models manage to remember or "anticipate" information about upcoming documents, serving as a proxy for evaluating models' adaptability to structured cyclic inputs.

Emergent Anticipatory Recovery

The study conclusively demonstrates that anticipatory recovery is an emergent behavior, significantly pronounced in larger LLMs. In a distinguishing experiment, the model exhibited anticipatory behaviour by reducing the loss associated with a document before retraining on it, effectively recovering knowledge ahead of its scheduled re-learning. This effect was magnified in models with a larger parameter count, underlining the role of model capacity in facilitating such recovery.

Other Influential Factors

Our analysis also explored a breadth of factors impacting anticipatory recovery, including the effects of training hyperparameters, model architecture variations, and the optimizer used. Key findings indicate that the model's capability to deeply fit each task within each epoch potentiates anticipatory recovery. Moreover, modifications in data preparation approaches such as random masking and window shifting of the data pointed to the resilience of anticipatory recovery against minor randomness in data presentation, albeit with reduced effects.

Anticipatory Recovery in Vision Models

Expanding the study's scope beyond LLMs, we delved into computer vision tasks - causal image modeling and image classification - employing Image GPT and Vision Transformer (ViT) models respectively. The consistent emergence of anticipatory recovery across these tasks, not confined solely to the field of natural language processing, advocates for a broader applicability of this phenomenon.

Understanding Cyclic Training Dynamics

Delving deeper into the mechanics behind anticipatory recovery, our analyses of training dynamics, comprising of weight transformations, gradient similarities, and activation patterns, yielded insightful temporal structures indicative of cyclic learning behaviors. Particularly, the spiral trajectory observed in weight space, alongside the structured evolution of model gradients and activations, offers a nuanced comprehension of how cyclic training influences network learning processes across epochs.

This work aligns with the broader discourse on cyclic and structured training, online learning, and the emergence of capabilities in large-scale models, extending the dialogue to encompass anticipatory behavior as a byproduct of cyclic exposure to data. Moreover, it nuances the conversation around catastrophic interference by presenting an instance where structured repetition leads to an emergent mitigation strategy.

Discussion

By spotlighting anticipatory recovery, this study uncovers a novel aspect of LLMs' learning dynamics, emphasizing that beyond sheer capacity, the structured sequencing of training data can invoke sophisticated learning behaviors in neural networks. These findings not only anchor anticipatory recovery as a potential avenue for mitigating catastrophic interference but also suggest pathways for devising enhanced training strategies that leverage structured repetition to bolster model performance across tasks.

Moving forward, the exploration of anticipatory recovery within more complex and hierarchical structured environments, alongside the investigation into curating optimal learning curricula, stands as a compelling frontier for furthering our understanding of LLMs and their emergent properties.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 22 likes about this paper.