Effective one-shot KV context compaction

Determine efficient single-pass key–value cache compaction procedures for Transformer-based autoregressive language models that reduce the size of the KV cache while preserving downstream model behavior when the compacted prefix is later concatenated with uncompacted and future tokens.

Background

The paper addresses the growing memory bottleneck in Transformer-based LLMs caused by the KV cache, especially in long-context settings. Common practices such as summarization or token dropping are lossy and can harm downstream performance, motivating more principled compaction methods.

Cartridges provide high-quality latent-space compaction but require expensive gradient-based optimization. The authors propose Attention Matching as a faster alternative that approximates attention outputs and mass, aiming to preserve model behavior after compaction. Despite these advances, the general problem of achieving effective compaction in a single pass without degrading behavior is explicitly stated as open.

References

Effective context compaction—reducing KV cache size in a single pass while preserving downstream model behavior—remains an important open problem.

— Fast KV Compaction via Attention Matching (2602.16284 - Zweiger et al., 18 Feb 2026) in Section 1 (Introduction)

Effective one-shot KV context compaction

Background

References

Related Problems