Papers
Topics
Authors
Recent
Search
2000 character limit reached

KVzap: Adaptive KV Cache Pruning for LLMs

Updated 14 January 2026
  • KVzap is a lightweight, input-adaptive key-value caching method for transformer-based LLMs that accelerates inference by compressing cache representations.
  • It uses learned surrogate models to approximate oracle scores, enabling 2–4× compression while maintaining downstream accuracy within ≤0.2% loss.
  • KVzap introduces minimal compute overhead and seamlessly supports both prefilling and decoding stages, compatible with optimized attention kernels.

KVzap is a lightweight, input-adaptive key-value (KV) cache pruning method for transformer-based LLM inference that addresses the memory and bandwidth bottlenecks introduced by growing context lengths. By learning a surrogate model to approximate the state-of-the-art KVzip oracle, KVzap predicts importance scores for each key–value pair using the model’s hidden states, thereby achieving 2–4× KV cache compression with negligible (≤0.2%) loss in downstream accuracy across long-context reasoning and QA benchmarks. KVzap introduces minimal (<1.1% of a transformer layer) compute overhead, operates in both prefilling and decoding stages, and is compatible with highly optimized attention kernels such as FlashAttention2 and PagedAttention (Jegou et al., 12 Jan 2026).

1. Methodological Foundations

The core objective in KV cache pruning is to compress past sequence representations along the time axis, T, to accelerate attention mechanisms while minimizing degradation in model faithfulness. KVzap is fundamentally designed to:

  • Approximate the optimal but impractical KVzip oracle by training compact surrogates (one per layer and head), thereby making score computation tractable for both prefilling and autoregressive decoding without double-prefill or extra cache access.
  • Achieve a target compression ratio (CR) between 2–4× by pruning low-importance KV pairs as measured via surrogate scores.
  • Maintain model compatibility and phase-agnostic operation, supporting both the full-context-prefilling and single-step-decoding regimes found in practical deployments.

2. Algorithmic Pipeline

The KVzap methodology consists of several critical components:

2.1 KVzip and KVzip+ Oracles

Original KVzip scoring at position ii is defined as the maximum attention weight for ii when the model copies its own prompt:

si=maxjRepeatPromptajis_i = \max_{j \in \text{RepeatPrompt}} a_{ji}

KVzip+ augments this by scaling the attention with the output value’s norm relative to its hidden state:

si+=maxjRepeatPromptaji(WOvihj)s_i^+ = \max_{j \in \text{RepeatPrompt}} a_{ji} \cdot \left( \frac{\| W_O v_i \|}{\| h_j \|} \right)

Exact calculation of si+s_i^+ is prohibitively costly during decoding due to the required second pass.

2.2 Surrogate Modeling

To circumvent expensive oracle computation, KVzap trains small, per-(layer,head) surrogates f,hf_{\ell,h} mapping hidden states htRDhh_t \in \mathbb{R}^{D_h} to approximate logst,h+\log s^+_{t,h}. Two architectures are supported:

  • KVzap-Linear: A single projection matrix WRDh×HW \in \mathbb{R}^{D_h \times H} generates all head scores.
  • KVzap-MLP: Two-layer MLP with a hidden size of Dh/8D_h/8 and GELU activation, outputting H scores.

Training uses 1.2 million hidden state–score pairs generated from diverse domains (English, multilingual text, code, math). Validation R2R^2 scores on held-out data per head:

2.3 Threshold-Based Pruning

At inference, after QKV projection per layer, the surrogate produces scores SRT×HS \in \mathbb{R}^{T \times H}. Given a fixed global threshold τ\tau, only positions with St,hτS_{t,h} \geq \tau are retained per head; the rest are pruned. To safeguard local recency information, the last w=128w=128 tokens are always retained (sliding window). This adaptively tunes the cache compression ratio based on prompt “information density.”

Prefilling pseudocode:

1
2
3
4
5
6
7
def compress(hidden_states, keys, values, kvzap_model, τ, window=128):
    scores = kvzap_model(hidden_states)      # (T, H)
    scores[-window:] = +np.inf              # preserve recent tokens
    keep_mask = (scores >= τ)               # boolean mask (T, H)
    new_keys   = keys[keep_mask]
    new_values = values[keep_mask]
    return new_keys, new_values

2.4 Decoding Adaptation

During decoding, pruning applies as in prefilling but operates on a rolling buffer. Each generation step uses updated surrogate scores from the new hidden state to decide if and which prior KV pairs to evict, maintaining the last ww tokens unpruned. Because surrogates consume only already-available hidden states, no additional cache reads are required.

3. Comparison to Prior KV Cache Pruning Methods

A head-to-head comparison highlights the improvement brought by KVzap regarding speed, compatibility, and fidelity.

Method Prefill Passes Decoding-Compatible Overhead Max CR Accuracy Loss
KVzip 2× (double-length) No High ≈0%
KVzip+ (oracle) 2 No Very High ≈0%
ExpectedAttn 1 Yes Medium ≈2× 1–3%
DuoAttention 1 Yes Medium ≈2× 2–5%
Compactor 1 Yes Medium ≈2.5× 1–3%
KVzap-Linear 1 Yes ≈0.02% FLOPs ≤0.2%
KVzap-MLP 1 Yes ≈1.1% FLOPs 3.5× ≤0.2%

KVzap eliminates the double-prefill overhead of KVzip, allows pruning during decoding, enables adaptive CR per prompt, and introduces negligible extra computation, enabling direct integration with high-efficiency attention kernels.

4. Implementation and Deployment Considerations

4.1 Prefilling Pipeline

Tokenized input is embedded and fed into transformer layers. At each layer \ell:

  • QKV projections and attention yield the hidden state hh^\ell.
  • kvzap_model_\ell produces importance scores.
  • KV pairs with scores above threshold τ\tau (except sliding window ww) are retained.

4.2 Decoding Pipeline

For steps t=1...Tdect=1...T_{dec}:

  • Compute QtQ^t for the current hidden state.
  • Retrieve pruned Kpruned,VprunedK_\mathrm{pruned}, V_\mathrm{pruned} for each layer/head.
  • Compute attention and output hth^t.
  • Update the per-head score buffer using kvzap_model(ht)_\ell(h^t), prune as necessary.
  • Generate the next token.

4.3 Surrogate Model Storage and Compute

KVzap-Linear stores LL weight matrices of size (Dh×H)(D_h \times H) in fp16 or int8; KVzap-MLP stores $2L$ matrices: (Dh×Dh/8)(D_h \times D_h/8) and (Dh/8×H)(D_h/8 \times H). Relative FLOPs overhead per transformer layer:

Model KVzap-MLP KVzap-Linear
Qwen3-8B 1.09% 0.02%
Llama-3.1-8B 0.96% 0.02%
Qwen3-32B 0.67% 0.01%

5. Empirical Results Across Models and Tasks

Summary results demonstrate that KVzap maintains faithfulness and achieves competitive or state-of-the-art cache compression.

Model KVzap Variant τ\tau RULER 4k RULER 16k LongBench AIME25 pass@4 Avg. CR
Qwen3-8B MLP –4 95.32→95.09 (0.74) 92.99→92.78 (0.72) 46.74→46.49 (0.66) 0.77→0.77 (0.75) 3.5×
Llama-3.1-8B Linear –7 95.69→95.55 (0.68) 93.42→93.29 (0.70) 45.25→44.65 (0.62) 3.0×
Qwen3-32B MLP –4 95.65→95.95 (0.68) 95.19→94.96 (0.65) 50.56→50.40 (0.57) 0.83→0.87 (0.60) 2.7×

(Numbers “A→B(x)” show baseline → pruned accuracy and xx = fraction of cache kept.)

  • On RULER (4k/16k tokens), both MLP and Linear surrogates maintain ≥95% accuracy up to 3–4× CR, matching or slightly exceeding the KVzip+ oracle.
  • LongBench results show near-baseline performance (±0.2%) up to 2–3× CR, with adaptive CR in denser, real-world contexts.
  • On AIME25 reasoning, the MLP variant preserves pass@1/4 rates even when discarding over 50% of the cache.
  • Reduction of Cache by 3× directly translates to lowered memory traffic and increased decoding throughput in long-context generation, conditional on effective attention kernel support.

6. Insights, Ablations, and Leaderboard Placement

  • Threshold vs. Top-kk Pruning: Fixed τ\tau introduces per-prompt compression variability (±20%) aligned with the information density of each prompt. This thresholding approach outperforms fixed top-kk methods across head/layer.
  • Sliding Window Ablation: Disabling recent-token preservation (w=0w=0) results in dramatic accuracy loss (28% on LongBench). Setting w=128w=128 recovers near-baseline performance (62.5%), with no substantial gains beyond w=512w=512.
  • Surrogate Robustness: Despite moderate R2R^2 values (0.63–0.77), pruning is highly robust due to the stability of importance ranking.
  • Leaderboard Standing: On the NVIDIA KVpress leaderboard, KVzap variants offer the best trade-off between cache compression and RULER-4k accuracy across 15+ published methods, aligning deployable performance with best results formerly attainable only by academic, non-deployable approaches.

7. Significance in Production LLM Systems

KVzap fulfills key deployment criteria for LLM inference: rapid pruning based on information-aware surrogates, compatibility with decoding and streaming workloads, kernel-friendliness, and negligible reduction in model quality. By leveraging learned approximations of oracle-style scoring, it allows for significant memory savings and inference acceleration in practical, large-scale deployments without requiring model retraining or substantial engineering overhead (Jegou et al., 12 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KVzap.