Cached Transformers: Improving Transformers with Differentiable Memory Cache

Published 20 Dec 2023 in cs.CV | (2312.12742v1)

Abstract: This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating unit to continuously update the cache, our model achieves significant advancements in \textbf{six} language and vision tasks, including language modeling, machine translation, ListOPs, image classification, object detection, and instance segmentation. Furthermore, our approach surpasses previous memory-based techniques in tasks such as language modeling and displays the ability to be applied to a broader range of situations.

Abstract PDF HTML Upgrade to Chat

References (58)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces the Cached Transformer with a Gated Recurrent Cache to efficiently capture long-term dependencies.
It integrates seamlessly with various Transformer architectures, enhancing accuracy in language and vision tasks.
Empirical results across benchmarks demonstrate significant performance gains and improved computational efficiency.

Introduction to Cached Transformers

The field of AI has seen significant developments with the introduction of the Transformer model, which revolutionized tasks like language processing and computer vision by stacking layers that utilize the self-attention mechanism. This architecture has been particularly effective because it allows each element—be it a word or image part—to interact with all other elements directly, facilitating global receptive fields and context-aware processing. However, this effectiveness comes with a steep computational cost, typically growing with the square of the sequence length, which hampers modeling long-range dependencies. A novel solution has emerged to overcome this challenge while retaining the benefits of the Transformer architecture: the Cached Transformer with a Gated Recurrent Cache (GRC).

Gated Recurrent Cache (GRC) Mechanism

The GRC mechanism serves as the cornerstone of the Cached Transformer, efficiently storing historical token representations in a compact differentiable memory cache. This enables extended and dynamic receptive fields within the Transformer structure, allowing it to account for long-term dependencies by continuously updating and retaining critical past information. The innovation hinges on a recurrent gating unit resembling those found in gated recurrent neural networks but tailored for Transformers. This mechanism has been demonstrated to lead to substantial performance improvements in a spectrum of applications, including language modeling, machine translation, image classification, object detection, and instance segmentation.

Versatility Across Tasks and Models

The versatility of GRC is evident from its compatibility and improved performance across diverse Transformer models and tasks. Integration with models such as Transformer-XL, ViT, PVT, Swin, Bigbird, and Reformer showcases not only the plug-and-play nature of GRC but also its universally beneficial impact. This adaptability has set a benchmark, marking Cached Transformers as a highly promising avenue for advancing Transformer efficiency and ability to process extensive sequential data or images.

Enhancements and Empirical Validation

Empirically, the GRC mechanism has been validated across multiple language and vision benchmarks, reliably outperforming existing models and techniques. For example, when incorporated into vision transformers, it captures instance-invariant features effectively and boosts classification accuracy through cross-sample regularization. In language tasks, it surpasses memory-based methods and is sensitive to a variety of Transformer modifications and settings. Moreover, experiments in machine translation highlight GRC's capacity to refine LLMs across different language pairs. These results collectively demonstrate the ability of GRC to enrich Transformer models, making them more adept at complex, long-range tasks without excessive computation or memory demands.

In conclusion, the introduction of the Cached Transformer with GRC offers a robust solution for the Transformer model's limitations, enhancing its ability to model long-term dependencies. Its compatibility with various Transformer architectures and tasks, coupled with its demonstrated performance benefits, presents a significant step forward in the ongoing evolution of deep learning models.

Markdown Report Issue