On the Long Range Abilities of Transformers
Abstract: Despite their dominance in modern DL and, especially, NLP domains, transformer architectures exhibit sub-optimal performance on long-range tasks compared to recent layers that are specifically designed for this purpose. In this work, drawing inspiration from key attributes of long-range layers, such as state-space layers, linear RNN layers, and global convolution layers, we demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena (LRA) benchmark, thus narrowing the gap with these specialized layers. We identify that two key principles for long-range tasks are (i) incorporating an inductive bias towards smoothness, and (ii) locality. As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters. Our theory and experiments also shed light on the reasons for the inferior performance of transformers on long-range tasks and identify critical properties that are essential for successfully capturing long-range dependencies.
- Arij Al Adel. Global memory transformer for processing long documents. In Advances in Neural Computation, Machine Learning, and Cognitive Research VI: Selected Papers from the XXIV International Conference on Neuroinformatics, October 17-21, 2022, Moscow, Russia, pp. 343–352. Springer, 2022.
- Memory transformer with hierarchical attention for long document processing. In 2021 International Conference Engineering and Telecommunication (En&T), pp. 1–7. IEEE, 2021.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022a.
- Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022b.
- Decision s4: Efficient sequence-based rl via state spaces layers. In The Eleventh International Conference on Learning Representations, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Multi-head state space model for speech recognition. arXiv preprint arXiv:2305.12498, 2023.
- A practical survey on faster and lighter transformers. ACM Computing Surveys, 2021.
- Simple hardware-efficient long convolutions for sequence modeling. arXiv preprint arXiv:2302.06646, 2023.
- Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021a.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021b.
- On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
- Gmat: Global memory augmentation for transformers. arXiv preprint arXiv:2006.03274, 2020.
- Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022a.
- Simplifying and understanding state space models with diagonal linear rnns. arXiv preprint arXiv:2212.00768, 2022b.
- Liquid structural state-space models. arXiv preprint arXiv:2209.12951, 2022.
- Block-recurrent transformers. arXiv preprint arXiv:2203.07852, 2022.
- Efficient long-text understanding with short-text models. Transactions of the Association for Computational Linguistics, 11:284–299, 2023.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- What makes convolutional models great on long sequence modeling? arXiv preprint arXiv:2210.09298, 2022.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
- Structured state space models for in-context reinforcement learning. arXiv preprint arXiv:2303.03982, 2023.
- Focus your attention (with adaptive iir filters). arXiv preprint arXiv:2305.14952, 2023.
- Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441–2453, 2021.
- Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655, 2022.
- Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
- Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. arXiv preprint arXiv:2306.15794, 2023.
- Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349, 2023.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Toeplitz neural network for sequence modeling. arXiv preprint arXiv:2305.04749, 2023a.
- Hierarchically gated recurrent neural network for sequence modeling. arXiv preprint arXiv:2311.04823, 2023b.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Ckconv: Continuous kernel convolution for sequential data. arXiv preprint arXiv:2102.02611, 2021.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241. Springer, 2015.
- Diagonal state space augmented transformers for speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023a.
- Diagonal state space augmented transformers for speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023b.
- Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022.
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Pretraining without attention. arXiv preprint arXiv:2212.10544, 2022.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Lightweight and efficient end-to-end speech recognition using low-rank transformer. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6144–6148. IEEE, 2020.
- Simple local attentions remain competitive for long-context tasks. arXiv preprint arXiv:2112.07210, 2021.
- Megabyte: Predicting million-byte sequences with multiscale transformers. arXiv preprint arXiv:2305.07185, 2023.
- Effectively modeling time series with simple discrete state spaces. arXiv preprint arXiv:2303.09489, 2023.
- Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International Conference on Machine Learning, pp. 27268–27286. PMLR, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.