Papers
Topics
Authors
Recent
Search
2000 character limit reached

How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse

Published 3 Apr 2024 in cs.LG, cs.AI, and cs.CL | (2404.02690v2)

Abstract: Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. This is achieved by selectively ignoring smaller entries in the attention matrix during the softmax function computation. Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient LLMs deployment. Despite its widespread use, a theoretical understanding of the conditions under which sparse attention performs on par with traditional attention remains elusive. This work aims to $\textbf{bridge this gap by examining the inherent sparsity of standard attention processes}$. Our theoretical framework reveals several brand-new key insights: $\bullet$ Attention is $n{C}$-sparse, implying that considering only the largest $\Omega(n{C})$ entries out of all $n$ entries is sufficient for sparse attention to approximate the exact attention matrix with decreasing loss. Here, $n$ represents the input length and $C \in (0, 1)$ is a constant. $\bullet$ Stable $o(\log(n))$-sparse attention, which approximates attention computation with $\log(n)$ or fewer entries, may not be feasible since the error will persist at a minimum of $O(1)$. $\bullet$ An adaptive strategy ($\alpha \cdot nC, \alpha \in \mathbb{R}$) for the window size of efficient attention methods rather than a fixed one is guaranteed to perform more accurately and efficiently in a task for inference on flexible context lengths.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. One pass streaming algorithm for super long token attention approximation in sublinear space. arXiv preprint arXiv:2311.14652, 2023.
  2. Fast attention requires bounded entries. Advances in Neural Information Processing Systems (NeurIPS), 36, 2023.
  3. The fine-grained complexity of gradient computation for training large language models. arXiv preprint arXiv:2402.04497, 2024.
  4. How to capture higher-order correlations? generalizing matrix softmax attention to kronecker computation. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
  5. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  6. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Quantizable transformers: Removing outliers by helping attention heads do nothing. Advances in Neural Information Processing Systems, 36, 2024.
  9. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  10. Algorithm and hardness for dynamic attention maintenance in large language models. arXiv preprint arXiv:2304.02207, 2023.
  11. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  12. ChatGPT. Optimizing language models for dialogue. OpenAI Blog, November 2022.
  13. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  14. A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. arXiv preprint arXiv:2102.10707, 2021.
  15. Mongoose: A learnable lsh framework for efficient neural network training. In International Conference on Learning Representations, 2020.
  16. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  17. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015, 2019.
  18. Fine-tune language models to approximate unbiased in-context learning. arXiv preprint arXiv:2310.03331, 2023.
  19. How to protect copyright data in optimization of large language models? arXiv preprint arXiv:2308.12247, 2023.
  20. How to protect copyright data in optimization of large language models? In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17871–17879, 2024.
  21. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  22. Dynamic kernel sparsifiers. arXiv preprint arXiv:2211.14825, 2022.
  23. Attentive walk-aggregating graph neural networks. Transactions on Machine Learning Research, 2022.
  24. Zero-th order algorithm for softmax attention optimization. arXiv preprint arXiv:2307.08352, 2023.
  25. Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
  26. Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arXiv preprint arXiv:2304.04397, 2023.
  27. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
  28. Unmasking transformers: A theoretical approach to data recovery via attention weights. arXiv preprint arXiv:2310.12462, 2023.
  29. Superiority of softmax: Unveiling the performance edge over linear attention. arXiv preprint arXiv:2310.11685, 2023.
  30. Fourier circuits in neural networks: Unlocking the potential of large language models in mathematical reasoning and modular arithmetic. arXiv preprint arXiv:2402.09469, 2024.
  31. A sublinear adversarial training algorithm. arXiv preprint arXiv:2208.05395, 2022.
  32. A fast optimization view: Reformulating single layer attention in llm based on tensor and svm trick, and solving it in matrix multiplication time. arXiv preprint arXiv:2309.07418, 2023.
  33. In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023.
  34. Gradientcoin: A peer-to-peer decentralized large language models. arXiv preprint arXiv:2308.10502, 2023.
  35. Outlier-efficient hopfield layers for large transformer-based models. 2024.
  36. Nonparametric modern hopfield models. 2024.
  37. Hyperattention: Long-context attention in near-linear time. arXiv preprint arXiv:2310.05869, 2023.
  38. On computational limits of modern hopfield models: A fine-grained complexity analysis. arXiv preprint arXiv:2402.04520, 2024.
  39. Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10):4723–4741, 2009.
  40. Attention mechanism for neural machine translation: A survey. In 2021 IEEE 5th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), volume 5, pages 1485–1489. IEEE, 2021.
  41. On sparse modern hopfield model, 2023.
  42. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  43. Polysketchformer: Fast transformers via sketches for polynomial kernels. arXiv preprint arXiv:2310.01655, 2023.
  44. Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
  45. Proxyformer: Nyström-based linear transformer with trainable proxy tokens. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 13418–13426, 2024.
  46. Large memory layers with product keys. Advances in Neural Information Processing Systems, 32, 2019.
  47. A theoretical insight into attack and defense of gradient leakage in transformer. arXiv preprint arXiv:2311.13624, 2023.
  48. Solving regularized exp, cosh and sinh regression problems. arXiv preprint, 2303.15725, 2023.
  49. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023.
  50. Fine-tuning language models with just forward passes. arXiv preprint arXiv:2305.17333, 2023.
  51. Evan Miller. Attention is off by one. 2023.
  52. A kernel-based view of language model fine-tuning. In International Conference on Machine Learning, pages 23610–23641. PMLR, 2023.
  53. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  54. Task-specific skill localization in fine-tuned language models. arXiv preprint arXiv:2302.06600, 2023.
  55. Adore: Differentially oblivious relational database operators. In VLDB, 2022.
  56. Adaptive and dynamic multi-resolution hashing for pairwise summations. In BigData, 2022.
  57. Fast submodular function maximization. arXiv preprint arXiv:2305.08367, 2023.
  58. Efficient sgd neural network training via sublinear activated neuron identification. arXiv preprint arXiv:2307.06565, 2023.
  59. A general algorithm for solving rank-one matrix sensing. arXiv preprint arXiv:2303.12298, 2023.
  60. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 101–156. PMLR, 2023.
  61. Improving language understanding by generative pre-training. ., 2018.
  62. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  63. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  64. The trade-off between universality and label efficiency of representations from contrastive learning. In The Eleventh International Conference on Learning Representations, 2023.
  65. Domain generalization via nuclear norm regularization. In Conference on Parsimony and Learning, pages 179–201. PMLR, 2024.
  66. Deep online fused video stabilization. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1250–1258, 2022.
  67. A graph-theoretic framework for understanding open-world semi-supervised learning. Advances in Neural Information Processing Systems, 36, 2024.
  68. When and how does known class help discover unknown ones? provable understanding through spectral analysis. In International Conference on Machine Learning, pages 33014–33043. PMLR, 2023.
  69. A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. In International Conference on Learning Representations, 2022.
  70. Provable guarantees for neural networks via gradient feature learning. Advances in Neural Information Processing Systems, 36, 2024.
  71. A unified scheme of resnet and softmax. arXiv preprint arXiv:2309.13482, 2023.
  72. An automatic learning rate schedule algorithm for achieving faster convergence and steeper descent. arXiv preprint arXiv:2310.11291, 2023.
  73. Sparse attention with learning to hash. In International Conference on Learning Representations, 2021.
  74. Solving attention kernel regression problem via pre-conditioner. arXiv preprint arXiv:2308.14304, 2023.
  75. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
  76. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  77. Uniform memory retrieval with larger capacity for modern hopfield models. 2024.
  78. STanhop: Sparse tandem hopfield model for memory-enhanced time series prediction. In The Twelfth International Conference on Learning Representations, 2024.
  79. Bishop: Bi-directional cellular learning for tabular data with generalized sparse modern hopfield model. 2024.
  80. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  81. Towards few-shot adaptation of foundation models via multitask finetuning. In The Twelfth International Conference on Learning Representations, 2024.
  82. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  83. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14138–14148, 2021.
  84. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  85. Kdeformer: Accelerating transformers via kernel density estimation. In International Conference on Machine Learning, pages 40605–40623. PMLR, 2023.
  86. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  87. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  88. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024.
Citations (4)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.