Papers
Topics
Authors
Recent
Search
2000 character limit reached

Attention Cache in Transformers

Updated 4 February 2026
  • Attention cache is a mechanism that stores and reuses key and value projections in transformer models, reducing redundant O(N^2) computations to O(N) during inference.
  • Various strategies like multi-query attention, token pruning, and low-rank compression optimize the cache to manage memory bottlenecks and maintain model accuracy.
  • Design trade-offs include balancing latency and memory usage, ensuring kernel compatibility, and adapting caching strategies for long-context and multimodal tasks.

An attention cache refers to any system or mechanism for storing and reusing the key and value projections necessary for transformer-style attention during inference or generation, thereby reducing computational redundancy and significantly improving efficiency. Attention caches are central to nearly all efficient LLM, vision-LLM (VLM), and diffusion model serving stacks, both as a source of major memory bottleneck and as the target of aggressive algorithmic optimization. The growing prevalence of long-context and multimodal transformers, and the surge in demand for interactive (low-latency) or high-throughput LLM deployment, have spurred a broad array of research into learning- and inference-time cache management, compression, and sharing strategies.

1. Fundamental Role of the Attention Cache in Transformers

In transformer architectures—including LLMs, masked autoregressive models, VLMs, and diffusion models—self-attention at each layer and token requires computing affinities between the query QQ and a set of keys KK and values VV derived from all previously seen tokens. Naïve inference recomputes KK and VV for every past position at every step, incurring O(N2)\mathcal{O}(N^2) computation and memory. The attention cache eliminates this redundancy by storing the K,VK, V projections for each position, layer, and in many models, each head, so subsequent attention can be performed by a single matrix multiplication against the (growing) cached matrices. This reduces per-step compute to O(N)\mathcal{O}(N) but creates a new bottleneck: cache size scales linearly with the sequence length and number of layers/heads, creating prohibitive memory demands for long contexts or large batches (Brandon et al., 2024).

2. Anatomy and Constraints of Attention Cache Design

The canonical attention cache consists of, at each layer ℓ\ell and for each attention head hh, arrays K(ℓ,h)∈RT×dkK^{(\ell,h)}\in\mathbb{R}^{T\times d_k} and V(ℓ,h)∈RT×dkV^{(\ell,h)}\in\mathbb{R}^{T\times d_k}, where TT is the context length and dkd_k the head dimension. In multi-head attention (MHA), one stores HH distinct K,VK, V arrays per layer, whereas in Multi-Query Attention (MQA) or Grouped-Query Attention (GQA), several heads share K,VK,V, reducing the number of stored arrays (Brandon et al., 2024). Under standard settings, total cache size is:

Bytestotal=S×B×L×nkv×dhead×2×(P/8)\text{Bytes}_\text{total} = S \times B \times L \times n_\text{kv} \times d_\text{head} \times 2 \times (P/8)

where SS: sequence length, BB: batch size, LL: number of layers, nkvn_\text{kv}: number of distinct KV heads, dheadd_\text{head}: head dimension, and PP is precision in bits (Brandon et al., 2024).

Constraints emerge from GPU/TPU RAM size, bandwidth, and latency, as well as the need to support low prefill latency (time to first token), high throughput, and minimal cache staleness for generation or retrieval.

3. Strategies for Attention Cache Reduction and Management

Numerous, often complementary, lines of research address the memory and compute demands of the attention cache:

3.1. Head and Layer Sharing

  • Multi-Query Attention (MQA)/Grouped-Query Attention (GQA): Sharing K,VK,V across query groups or all heads, reduces the cache width, often with minimal accuracy loss (Brandon et al., 2024).
  • Cross-Layer Attention (CLA): Adjacent layers share K,VK,V (e.g., every two layers use the same cache). This reduces cache depth and, in combination with MQA, achieves up to 2×2\times cache reduction at negligible performance cost (Brandon et al., 2024).

3.2. Token Pruning and Importance-based Eviction

  • Expected Attention (EA): Uses a closed-form score, derived from the distribution of future queries, to rank KV pairs for pruning without retraining or access to all future attention matrices (Devoto et al., 1 Oct 2025).
  • Self-Attention Guided Eviction (SAGE-KV): After a forward prefill pass, uses the model's own final attention snapshot to select the most relevant tokens to keep in the cache; combines the performance of dynamic methods with the low-overhead of static pruning (Wang et al., 11 Mar 2025).
  • Adaptive Holistic Attention (AhaKV): Corrects for positional bias and leverages both recent attention and VV-vector norms to select tokens, leading to unbiased token retention (Gu et al., 4 Jun 2025).
  • Task-Aware (Task-KV): Dynamically allocates more cache to attention heads identified as semantically heterogeneous, less to those aggregating or focusing, improving performance at fixed memory (He et al., 25 Jan 2025).
  • Head- and Block-wise Adaptive Paging (KV-Compress): Evicts cache blocks at the (layer, head) granularity, enabling fine control over memory usage and throughput (Rehg, 2024).

3.3. Quantization, Low-Rank, and Latent Compression

  • Eigen Attention / SALS: Projects K,VK, V into a learned low-rank subspace, drastically shrinking cache size and reducing compute, with minimal accuracy loss if low-rank structure is preserved pre-RoPE (Saxena et al., 2024, Mu et al., 28 Oct 2025).
  • KVSink: Identifies and preserves "attention sink" tokens (tokens disproportionately attended in future steps, often with extreme activations) to avoid outlier-induced quantization error (Su et al., 6 Aug 2025).
  • Hybrid Sparsity and Shared Cache (HySparse): Alternates sparse and dense layers, using full attention layers as oracles to determine important tokens and share their KV cache across subsequent sparse layers, often achieving 10×10\times memory savings (Gao et al., 3 Feb 2026).

3.4. Multimodal, Masked, and Structured Attention Caching

  • Cache-Aware Attention (MARché): Differentiates between active and cached tokens during generation, only recomputing K,VK,V for those likely to need updates, with selective refresh triggered by changes in attention (e.g., via attention scores from new tokens) (Jiang et al., 22 May 2025).
  • Prompt Cache: For LLMs with templated or repetitive prefixes, enables modular caching and reuse of attention states across different prompts, dramatically reducing prefill latency (Gim et al., 2023).
  • Spatial-Temporal Sparse Attention (PureKV): For VLLMs and long video inputs, combines lower-layer importance estimation and masking for consistent cache compression compatible with efficient attention kernels (Jiang et al., 29 Oct 2025).
  • Temporal Correspondence (TempCache): In diffusion/world models, groups near-duplicate keys across frames, bounding cache growth and sparsifying attention for long-range video synthesis (Samuel et al., 2 Feb 2026).

4. Application-Specific Insights: Code Generation, Vision, and Diffusion

Several recent works exploit structural sparsity patterns unique to their domain:

  • Anchor Attention (AnchorCoder): In code LLMs, attention becomes near-sparse after the first layer, with most attention mass focused on line-break tokens ("anchors"). By compressing context into anchor points and revisiting early-layer anchors with cross-layer attention, the cache footprint can be reduced by ≥70%\geq 70\% with negligible loss in code generation quality (Zhang et al., 2024).
  • Patch-drive Relational Cache (PRGA): In cache-based few-shot vision tasks, intra-image patch dependencies are distilled into the cache via graph attention, enriching the representational diversity of support set keys for robust retrieval (Ahmad et al., 13 Dec 2025).

5. Practical Trade-Offs, Limitations, and Directions

Key limitations and trade-offs across methods include:

  • Latency vs. Memory: Cache quantization and eviction reduce RAM but can introduce controversial accuracy or latency trade-offs, especially in tasks requiring the retention of long-range dependencies (e.g., code generation, retrieval-augmented generation) (Zhang et al., 2024, Devoto et al., 1 Oct 2025).
  • Kernel Compatibility: Strategies compatible with efficient attention implementations (e.g., FlashAttention, block-sparse) are favored for practical deployment and often require cross-layer, RoPE-agnostic, or index-based importance estimation (Jiang et al., 29 Oct 2025, Mu et al., 28 Oct 2025).
  • Adaptive and Domain-Aware Caching: Many approaches (Task-KV, Anchor Attention) require task- or domain-specific anchor selection, semantic separation, or recurrence analysis, limiting direct off-the-shelf applicability (Zhang et al., 2024, He et al., 25 Jan 2025).
  • Interaction With Training: Most cache optimizations are post-training and inference-only, but learned (e.g., low-rank) projections and cache-adapted adapters can offer gains if integrated into pre-training or fine-tuning (Saxena et al., 2024, Ahmad et al., 13 Dec 2025).
  • Extreme Compression: Methods such as HCAttention and HySparse demonstrate that, even at 10×10\times–12×12\times cache compression, task accuracy may be preserved in specific regimes, but diminishing returns and bottlenecks (e.g., PCIe bandwidth, intra-cache drift) surface at higher ratios (Yang et al., 26 Jul 2025, Gao et al., 3 Feb 2026).

6. Experimental Benchmarks and Outcomes

Empirical validation across large benchmarks supports the broad applicability of attention cache optimizations:

Method Key Result / Benchmark Compression / Speedup Accuracy Impact
Expected Attention (Devoto et al., 1 Oct 2025) LongBench, RULER, Aime25, MATH-500 2–12x cache reduction <<1% drop (50%–90% removal)
SAGE-KV (Wang et al., 11 Mar 2025) LongBench, Llama3.1-8B, Qwen2.5-7B 2–4x cache reduction <<0.5 pt drop, better than static
AnchorCoder (Zhang et al., 2024) HumanEval+, MBPP, CodeLlama-7B ≥\geq 70% cache drop Pass@1: 31.7%31.7\% (vs. 31.1%31.1\% dense)
HCAttention (Yang et al., 26 Jul 2025) Llama-3-8B, LongBench, 4M tok, A100 4×4\times–8×8\times cache reduction <<1–1.7% task loss
HySparse (Gao et al., 3 Feb 2026) 7B/80B MoE, MMLU, GSM8K 4×4\times–10×10\times KV mem Outperforms full/sparse SWA
PureKV (Jiang et al., 29 Oct 2025) VideoLLaMA2, Qwen2.5-VL-7B, MVBench 5×5\times cache, 3×3\times speed <<5% ROUGE drop
CacheFormer (Singh et al., 18 Apr 2025) WikiText-103, enwik8 >>8–10% PPL drop Outperforms Long-Short/Performer/etc

7. Outlook: Open Directions and Synthesis

The evolving taxonomy of attention cache methods reflects a broader shift: from treating the cache as a passive buffer toward recognizing its information-theoretic, architectural, and workload-dependent properties. Key open directions include: optimizing cache and attention jointly with quantization, developing more adaptive and input-aware token selection heuristics, integrating cache logic with model parallelism and system-level resource schedulers, and exploring cache coherence for non-autoregressive, encoder–decoder, and retrieval-augmented generation. The attention cache, as a unifying abstraction, sits at the nexus of efficiency, accuracy, and scalability for all modern transformer inference workloads.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Attention Cache.