How Smooth Is Attention?
Abstract: Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties - which are key when it comes to analyzing robustness and expressive power - is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length $n$ and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $\sqrt{n}$ up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length $n$ is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of $n$. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.
- Cem Anil, James Lucas and Roger Grosse “Sorting out Lipschitz function approximation” In International Conference on Machine Learning, 2019, pp. 291–301 PMLR
- Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio “Neural machine translation by jointly learning to align and translate” In arXiv preprint arXiv:1409.0473, 2014
- Peter L Bartlett, Dylan J Foster and Matus J Telgarsky “Spectrally-normalized margin bounds for neural networks” In Advances in neural information processing systems 30, 2017
- “Invertible residual networks” In International conference on machine learning, 2019, pp. 573–582 PMLR
- Vladimir Igorevich Bogachev and Maria Aparecida Soares Ruas “Measure theory” Springer, 2007
- “JAX: composable transformations of Python+NumPy programs”, 2018 URL: http://github.com/google/jax
- “Language models are few-shot learners” In Advances in neural information processing systems 33, 2020, pp. 1877–1901
- “Towards evaluating the robustness of neural networks” In 2017 ieee symposium on security and privacy (sp), 2017, pp. 39–57 Ieee
- “Decision transformer: Reinforcement learning via sequence modeling” In Advances in neural information processing systems 34, 2021, pp. 15084–15097
- “Neural ordinary differential equations” In Advances in neural information processing systems 31, 2018
- “Residual flows for invertible generative modeling” In Advances in Neural Information Processing Systems 32, 2019
- “W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training”, 2021 arXiv:2108.06209 [cs.LG]
- “Parseval networks: Improving robustness to adversarial examples” In International conference on machine learning, 2017, pp. 854–863 PMLR
- Gwendoline De Bie, Gabriel Peyré and Marco Cuturi “Stochastic deep networks” In International Conference on Machine Learning, 2019, pp. 1556–1565 PMLR
- “Bert: Pre-training of deep bidirectional transformers for language understanding” In arXiv:1810.04805, 2018
- “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv:2010.11929, 2020
- “Efficient and accurate estimation of lipschitz constants for deep neural networks” In Advances in Neural Information Processing Systems 32, 2019
- “A Mathematical Perspective on Transformers”
- “The emergence of clusters in self-attention dynamics” In arXiv preprint arXiv:2305.05465, 2023
- Ian J Goodfellow, Jonathon Shlens and Christian Szegedy “Explaining and harnessing adversarial examples” In arXiv preprint arXiv:1412.6572, 2014
- “CertViT: Certified Robustness of Pre-Trained Vision Transformers” In arXiv preprint arXiv:2302.10287, 2023
- “Formal guarantees on the robustness of a classifier against adversarial manipulation” In Advances in neural information processing systems 30, 2017
- Bamdad Hosseini, Alexander W Hsu and Amirhossein Taghvaei “Conditional Optimal Transport on Function Spaces” In arXiv preprint arXiv:2311.05672, 2023
- Michael Janner, Qiyang Li and Sergey Levine “Offline reinforcement learning as one big sequence modeling problem” In Advances in neural information processing systems 34, 2021, pp. 1273–1286
- “Revisiting and Exploring Efficient Fast Adversarial Training via LAW: Lipschitz Regularization and Auto Weight Averaging” In arXiv preprint arXiv:2308.11443, 2023
- Hyunjik Kim, George Papamakarios and Andriy Mnih “The Lipschitz Constant of Self-Attention”, 2021 arXiv:2006.04710 [stat.ML]
- Alex Krizhevsky, Vinod Nair and Geoffrey Hinton “The CIFAR-10 dataset” In online: http://www. cs. toronto. edu/kriz/cifar. html 55.5, 2014
- Alexey Kurakin, Ian Goodfellow and Samy Bengio “Adversarial machine learning at scale” In arXiv preprint arXiv:1611.01236, 2016
- Fabian Latorre, Paul Rolland and Volkan Cevher “Lipschitz constant estimation of neural networks via sparse polynomial optimization” In arXiv preprint arXiv:2004.08688, 2020
- “Set transformer: A framework for attention-based permutation-invariant neural networks” In International conference on machine learning, 2019, pp. 3744–3753 PMLR
- Klas Leino, Zifan Wang and Matt Fredrikson “Globally-robust neural networks” In International Conference on Machine Learning, 2021, pp. 6212–6222 PMLR
- “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension” In arXiv:1910.13461, 2019
- “Generating wikipedia by summarizing long sequences” In arXiv:1801.10198, 2018
- “Roberta: A robustly optimized bert pretraining approach” In arXiv:1907.11692, 2019
- “Understanding and improving transformer from a multi-particle dynamic system point of view” In arXiv preprint arXiv:1906.02762, 2019
- “Distance-Based Classification with Lipschitz Functions.” In J. Mach. Learn. Res. 5.Jun, 2004, pp. 669–695
- “Distributional smoothing with virtual adversarial training” In arXiv preprint arXiv:1507.00677, 2015
- “Spectral normalization for generative adversarial networks” In arXiv preprint arXiv:1802.05957, 2018
- “Virtual adversarial training: a regularization method for supervised and semi-supervised learning” In IEEE transactions on pattern analysis and machine intelligence 41.8 IEEE, 2018, pp. 1979–1993
- Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi and Pascal Frossard “Deepfool: a simple and accurate method to fool deep neural networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2574–2582
- “Exploring generalization in deep learning” In Advances in neural information processing systems 30, 2017
- OpenAI “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
- “Distillation as a defense to adversarial perturbations against deep neural networks” In 2016 IEEE symposium on security and privacy (SP), 2016, pp. 582–597 IEEE
- “Approximation capability of neural networks on spaces of probability measures and tree-structured domains” In arXiv preprint arXiv:1906.00764, 2019
- “Computational optimal transport: With applications to data science” In Foundations and Trends® in Machine Learning 11.5-6 Now Publishers, Inc., 2019, pp. 355–607
- “Lipsformer: Introducing lipschitz continuity to vision transformers” In arXiv preprint arXiv:2304.09856, 2023
- “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
- “Robust speech recognition via large-scale weak supervision” In International Conference on Machine Learning, 2023, pp. 28492–28518 PMLR
- “Exploring the limits of transfer learning with a unified text-to-text transformer” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5485–5551
- “A case for new neural network smoothness constraints” PMLR, 2020
- “Sinkformers: Transformers with doubly stochastic attention” In International Conference on Artificial Intelligence and Statistics, 2022, pp. 3515–3530 PMLR
- “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter” In arXiv:1910.01108, 2019
- Filippo Santambrogio “Optimal transport for applied mathematicians” In Birkäuser, NY 55.58-63 Springer, 2015, pp. 94
- “Robust large margin deep neural networks” In IEEE Transactions on Signal Processing 65.16 IEEE, 2017, pp. 4265–4280
- Gilbert Strang “Calculus” SIAM, 1991
- “Intriguing properties of neural networks” In arXiv preprint arXiv:1312.6199, 2013
- Yusuke Tsuzuku, Issei Sato and Masashi Sugiyama “Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks” In Advances in neural information processing systems 31, 2018
- “Attention is all you need” In Advances in neural information processing systems 30, 2017
- “Lipschitz regularity of deep neural networks: analysis and efficient estimation” In Advances in Neural Information Processing Systems 31, 2018
- James Vuckovic, Aristide Baratin and Rémi Tachet Combes “A Mathematical Theory of Attention” In ArXiv abs/2007.02876, 2020
- James Vuckovic, Aristide Baratin and Remi Tachet des Combes “On the regularity of attention” In arXiv preprint arXiv:2102.05628, 2021
- “Fairseq S2T: Fast speech-to-text modeling with fairseq” In arXiv preprint arXiv:2010.05171, 2020
- “Evaluating the robustness of neural networks: An extreme value theory approach” In arXiv preprint arXiv:1801.10578, 2018
- “Huggingface’s transformers: State-of-the-art natural language processing” In arXiv preprint arXiv:1910.03771, 2019
- “Mitigating Transformer Overconfidence via Lipschitz Regularization” In arXiv preprint arXiv:2306.06849, 2023
- “Scaling vision transformers” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12104–12113
- “Point transformer. arXiv” In arXiv preprint arXiv:2012.09164, 2020
- “A functional perspective on learning symmetric functions with neural networks” In International Conference on Machine Learning, 2021, pp. 13023–13032 PMLR
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.