Unified compression of attention weights addressing all major memory bottlenecks

Develop compressed, parameter-efficient representations of transformer attention weights that simultaneously address all three major memory bottlenecks: (i) training-time memory from optimizer states and gradients, (ii) inference-time KV-cache size that scales with the number of heads and head dimension, and (iii) GPU cache pressure during attention computation in kernels such as flash-attention.

Background

The paper identifies three distinct sources of memory cost in transformer models: optimizer states and gradients during training, the key–value cache during autoregressive inference, and on-chip cache pressure during attention computation (e.g., in flash-attention kernels). While many methods reduce one or two of these costs, the authors note that creating a single compressed attention-weight representation that tackles all three simultaneously has been difficult.

Motivated by this challenge, the paper proposes Tucker Attention, a tensor-based factorization that aims to reduce parameter counts and KV-cache requirements while remaining compatible with flash-attention and RoPE. The statement highlights the broader, ongoing objective of unifying these memory-efficiency goals within a single attention-weight representation.

References

These memory costs manifest at multiple stages: during training through optimizer states and gradients, during inference through the KV-cache (whose size scales with the number of heads and head dimension), and during attention computation through GPU cache pressure in kernels like flash-attention . Designing compressed, parameter-efficient representations of attention weights that address all three bottlenecks simultaneously remains an open challenge.

Tucker Attention: A generalization of approximate attention mechanisms  (2603.30033 - Klein et al., 31 Mar 2026) in Section 1, Introduction