Unified compression of attention weights addressing all major memory bottlenecks
Develop compressed, parameter-efficient representations of transformer attention weights that simultaneously address all three major memory bottlenecks: (i) training-time memory from optimizer states and gradients, (ii) inference-time KV-cache size that scales with the number of heads and head dimension, and (iii) GPU cache pressure during attention computation in kernels such as flash-attention.
References
These memory costs manifest at multiple stages: during training through optimizer states and gradients, during inference through the KV-cache (whose size scales with the number of heads and head dimension), and during attention computation through GPU cache pressure in kernels like flash-attention . Designing compressed, parameter-efficient representations of attention weights that address all three bottlenecks simultaneously remains an open challenge.