Infinite Limits of Multi-head Transformer Dynamics
Abstract: In this work, we analyze various scaling limits of the training dynamics of transformer models in the feature learning regime. We identify the set of parameterizations that admit well-defined infinite width and depth limits, allowing the attention layers to update throughout training--a relevant notion of feature learning in these models. We then use tools from dynamical mean field theory (DMFT) to analyze various infinite limits (infinite key/query dimension, infinite heads, and infinite depth) which have different statistical descriptions depending on which infinite limit is taken and how attention layers are scaled. We provide numerical evidence of convergence to the limits and discuss how the parameterization qualitatively influences learned features.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Scenic: A jax library for computer vision research and beyond. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21393–21398, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems, 34:17084–17097, 2021.
- Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=KZJehvRKGD.
- Feature learning in infinite depth neural networks. In The Twelfth International Conference on Learning Representations, 2023.
- Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388–2464. PMLR, 2019.
- Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727–11737. PMLR, 2021.
- Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems, 35:32240–32256, 2022a.
- Feature-learning networks are consistent across widths at realistic scales, 2023.
- Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. arXiv preprint arXiv:2304.03408, 2023.
- Infinite attention: Nngp and ntk for deep attention networks. In International Conference on Machine Learning, pages 4376–4386. PMLR, 2020.
- Effective theory of transformers at initialization, 2023.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pages 2793–2803. PMLR, 2021.
- Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
- Simplifying transformer blocks. arXiv preprint arXiv:2311.01906, 2023.
- Geometric dynamics of signal propagation predict trainability of transformers, 2024.
- The shaped transformer: Attention models in the infinite depth-and-width limit. Advances in Neural Information Processing Systems, 36, 2024.
- Soufiane Hayou. On the infinite-depth limit of finite-width neural networks. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=RbLsYz1Az9.
- Neural signature kernels as infinite-width-depth-limits of controlled resnets. arXiv preprint arXiv:2303.17671, 2023.
- On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Statistical dynamics of classical systems. Physical Review A, 8(1):423, 1973.
- The influence of learning rule on representation dynamics in wide neural networks. arXiv preprint arXiv:2210.02157, 2022b.
- The deep bootstrap framework: Good online learners are good offline generalizers. arXiv preprint arXiv:2010.08127, 2020.
- A dynamical model of neural scaling laws, 2024b.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- Getting vit in shape: Scaling laws for compute-optimal model design. Advances in Neural Information Processing Systems, 36, 2024.
- Adaptive optimization in the ∞\infty∞-width limit. In The Eleventh International Conference on Learning Representations, 2022.
- The neural covariance sde: Shaped infinite depth-and-width networks at initialization. Advances in Neural Information Processing Systems, 35:10795–10808, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.