Cached Transformers: Improving Transformers with Differentiable Memory Cache
Abstract: This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.
- ETC: Encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- Brahma, S. 2018. Improved language modeling by decoding the past. arXiv preprint arXiv:1808.05908.
- Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
- Recurrent memory transformer. Advances in Neural Information Processing Systems, 35: 11079–11091.
- Memory transformer. arXiv preprint arXiv:2006.11527.
- End-to-end object detection with transformers. In European conference on computer vision, 213–229. Springer.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33: 9912–9924.
- The IWSLT 2015 Evaluation Campaign. In Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, 2–14. Da Nang, Vietnam.
- Report on the 11th IWSLT evaluation campaign. In Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign, 2–17. Lake Tahoe, California.
- On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
- Fine-Grained Classification via Categorical Memory Networks. IEEE Transactions on Image Processing, 31: 4186–4196.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- Convit: Improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, 2286–2296. PMLR.
- Improving neural language models with a continuous cache. arXiv preprint arXiv:1612.04426.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2961–2969.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172.
- Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
- A cache-based natural language model for speech recognition. IEEE transactions on pattern analysis and machine intelligence, 12(6): 570–583.
- Kupiec, J. 1989. Probabilistic models of short and long distance word dependencies in running text. In Speech and Natural Language: Proceedings of a Workshop Held at Philadelphia, Pennsylvania, February 21-23, 1989.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980–2988.
- Microsoft coco: Common objects in context. In European conference on computer vision, 740–755. Springer.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022.
- Retrieval augmented classification for long-tail visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6959–6969.
- Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
- Listops: A diagnostic dataset for latent tree learning. arXiv preprint arXiv:1804.06028.
- fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In NAACL-HLT (Demonstrations).
- Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.
- Improving language understanding by generative pre-training.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507.
- Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems, 32.
- Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9: 53–68.
- Dynamic Token Normalization improves Vision Transformers. In International Conference on Learning Representations.
- Not all memories are created equal: Learning to forget by expiring. In International Conference on Machine Learning, 9902–9912. PMLR.
- Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, 843–852.
- Long Range Arena : A Benchmark for Efficient Transformers. In International Conference on Learning Representations.
- Omninet: Omnidirectional representations from transformers. In International Conference on Machine Learning, 10193–10202. PMLR.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357. PMLR.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 32–42.
- Learning to remember translation history with a continuous cache. Transactions of the Association for Computational Linguistics, 6: 407–420.
- Attention is all you need. Advances in neural information processing systems, 30.
- Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 568–578.
- PVT v2: Improved baselines with Pyramid Vision Transformer. Computational Visual Media, 1–10.
- Cross-batch memory for embedding learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6388–6397.
- Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 22–31.
- Memorizing transformers. arXiv preprint arXiv:2203.08913.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 558–567.
- Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33: 17283–17297.
- Invariance matters: Exemplar memory for domain adaptive person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 598–607.
- Long-short transformer: Efficient transformers for language and vision. Advances in Neural Information Processing Systems, 34.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.