Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory
Abstract: Increasing the size of a Transformer does not always lead to enhanced performance. This phenomenon cannot be explained by the empirical scaling laws. Furthermore, the model's enhanced performance is closely associated with its memorization of the training samples. We present a theoretical framework that sheds light on the memorization during pre-training of transformer-based LLMs. We model the behavior of Transformers with associative memories using Hopfield networks, such that each transformer block effectively conducts an approximate nearest-neighbor search. In particular, the energy function in modern continuous Hopfield networks serves as an explanation for the attention mechanism, which we approximate with a distance-based energy function. By observing that the softmax function corresponds to the gradient of the LogSumExp function in the energy, and employing the majorization-minimization technique, we construct a global energy function designed to capture the layered architecture. We demonstrate a dependency between the model size and the dataset size for the model to achieve optimal performance, and we show that the achievable cross-entropy loss is bounded from below.
- S.-I. Amari. Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transactions on Computers, 100(11):1197–1206, 1972.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- T. W. J. Banks and T. Warkentin. Gemma: Introducing new state-of-the-art open models, 2024.
- Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
- Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2022.
- Word acquisition in neural language models. Transactions of the Association for Computational Linguistics, 10:1–16, 2022.
- Language model behavior: A comprehensive survey. Computational Linguistics, pages 1–58, 2024.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Triple descent and the two kinds of overfitting: Where & why do they appear? Advances in Neural Information Processing Systems, 33:3058–3069, 2020.
- On a model of associative memory with huge storage capacity. Journal of Statistical Physics, 168:288–299, 2017.
- Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796, 2024.
- Extending context window of large language models via semantic compression. arXiv preprint arXiv:2312.09571, 2023.
- Language models scale reliably with over-training and on downstream tasks. arXiv preprint arXiv:2403.08540, 2024.
- Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913, 2020.
- A. Gokaslan and V. Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations, 2019.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022a.
- An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022b.
- J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982.
- Ragged: Towards informed design of retrieval augmented generation systems. arXiv preprint arXiv:2403.09040, 2024.
- Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
- D. Krotov. Hierarchical associative memory. arXiv preprint arXiv:2107.06446, 2021.
- D. Krotov and J. J. Hopfield. Dense associative memory for pattern recognition. Advances in Neural Information Processing Systems, 29, 2016.
- A tutorial on energy-based learning. Predicting Structured Data, 1(0), 2006.
- Does syntax need to grow on trees? sources of hierarchical inductive bias in sequence-to-sequence networks. Transactions of the Association for Computational Linguistics, 8:125–140, 2020.
- Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
- Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, 2024.
- Grokking of hierarchical structure in vanilla transformers. arXiv preprint arXiv:2305.18741, 2023.
- Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
- J. Ortega and W. Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables, volume 30. SIAM, 1970.
- Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
- O. Press and L. Wolf. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163, 2017.
- Language models are unsupervised multitask learners. 2019.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Hopfield networks is all you need. In International Conference on Learning Representations, 2020.
- Using deepspeed and megatron to train megatron-turing NLG 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
- Augmenting self-attention with persistent memory. arXiv preprint arXiv:1907.01470, 2019.
- Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE Transactions on Signal Processing, 65(3):794–816, 2016.
- Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.