KVzap: Adaptive KV Cache Pruning for LLMs

Updated 14 January 2026

KVzap is a lightweight, input-adaptive key-value caching method for transformer-based LLMs that accelerates inference by compressing cache representations.
It uses learned surrogate models to approximate oracle scores, enabling 2–4× compression while maintaining downstream accuracy within ≤0.2% loss.
KVzap introduces minimal compute overhead and seamlessly supports both prefilling and decoding stages, compatible with optimized attention kernels.

KVzap is a lightweight, input-adaptive key-value (KV) cache pruning method for transformer-based LLM inference that addresses the memory and bandwidth bottlenecks introduced by growing context lengths. By learning a surrogate model to approximate the state-of-the-art KVzip oracle, KVzap predicts importance scores for each key–value pair using the model’s hidden states, thereby achieving 2–4× KV cache compression with negligible (≤0.2%) loss in downstream accuracy across long-context reasoning and QA benchmarks. KVzap introduces minimal (<1.1% of a transformer layer) compute overhead, operates in both prefilling and decoding stages, and is compatible with highly optimized attention kernels such as FlashAttention2 and PagedAttention (Jegou et al., 12 Jan 2026).

1. Methodological Foundations

The core objective in KV cache pruning is to compress past sequence representations along the time axis, T, to accelerate attention mechanisms while minimizing degradation in model faithfulness. KVzap is fundamentally designed to:

Approximate the optimal but impractical KVzip oracle by training compact surrogates (one per layer and head), thereby making score computation tractable for both prefilling and autoregressive decoding without double-prefill or extra cache access.
Achieve a target compression ratio (CR) between 2–4× by pruning low-importance KV pairs as measured via surrogate scores.
Maintain model compatibility and phase-agnostic operation, supporting both the full-context-prefilling and single-step-decoding regimes found in practical deployments.

2. Algorithmic Pipeline

The KVzap methodology consists of several critical components:

2.1 KVzip and KVzip+ Oracles

Original KVzip scoring at position $i$ is defined as the maximum attention weight for $i$ when the model copies its own prompt:

$s_i = \max_{j \in \text{RepeatPrompt}} a_{ji}$

KVzip+ augments this by scaling the attention with the output value’s norm relative to its hidden state:

$s_i^+ = \max_{j \in \text{RepeatPrompt}} a_{ji} \cdot \left( \frac{\| W_O v_i \|}{\| h_j \|} \right)$

Exact calculation of $s_i^+$ is prohibitively costly during decoding due to the required second pass.

2.2 Surrogate Modeling

To circumvent expensive oracle computation, KVzap trains small, per-(layer,head) surrogates $f_{\ell,h}$ mapping hidden states $h_t \in \mathbb{R}^{D_h}$ to approximate $\log s^+_{t,h}$ . Two architectures are supported:

KVzap-Linear: A single projection matrix $W \in \mathbb{R}^{D_h \times H}$ generates all head scores.
KVzap-MLP: Two-layer MLP with a hidden size of $D_h/8$ and GELU activation, outputting H scores.

Training uses 1.2 million hidden state–score pairs generated from diverse domains (English, multilingual text, code, math). Validation $R^2$ scores on held-out data per head:

Qwen3-8B: Linear 0.671, MLP 0.711
Llama-3.1-8B: Linear 0.743, MLP 0.772
Qwen3-32B: Linear 0.629, MLP 0.668

2.3 Threshold-Based Pruning

At inference, after QKV projection per layer, the surrogate produces scores $S \in \mathbb{R}^{T \times H}$ . Given a fixed global threshold $\tau$ , only positions with $S_{t,h} \geq \tau$ are retained per head; the rest are pruned. To safeguard local recency information, the last $w=128$ tokens are always retained (sliding window). This adaptively tunes the cache compression ratio based on prompt “information density.”

Prefilling pseudocode:

def compress(hidden_states, keys, values, kvzap_model, τ, window=128):
    scores = kvzap_model(hidden_states)      # (T, H)
    scores[-window:] = +np.inf              # preserve recent tokens
    keep_mask = (scores >= τ)               # boolean mask (T, H)
    new_keys   = keys[keep_mask]
    new_values = values[keep_mask]
    return new_keys, new_values

2.4 Decoding Adaptation

During decoding, pruning applies as in prefilling but operates on a rolling buffer. Each generation step uses updated surrogate scores from the new hidden state to decide if and which prior KV pairs to evict, maintaining the last $w$ tokens unpruned. Because surrogates consume only already-available hidden states, no additional cache reads are required.

3. Comparison to Prior KV Cache Pruning Methods

A head-to-head comparison highlights the improvement brought by KVzap regarding speed, compatibility, and fidelity.

Method	Prefill Passes	Decoding-Compatible	Overhead	Max CR	Accuracy Loss
KVzip	2× (double-length)	No	High	4×	≈0%
KVzip+ (oracle)	2	No	Very High	4×	≈0%
ExpectedAttn	1	Yes	Medium	≈2×	1–3%
DuoAttention	1	Yes	Medium	≈2×	2–5%
Compactor	1	Yes	Medium	≈2.5×	1–3%
KVzap-Linear	1	Yes	≈0.02% FLOPs	3×	≤0.2%
KVzap-MLP	1	Yes	≈1.1% FLOPs	3.5×	≤0.2%

KVzap eliminates the double-prefill overhead of KVzip, allows pruning during decoding, enables adaptive CR per prompt, and introduces negligible extra computation, enabling direct integration with high-efficiency attention kernels.

4. Implementation and Deployment Considerations

4.1 Prefilling Pipeline

Tokenized input is embedded and fed into transformer layers. At each layer $\ell$ :

QKV projections and attention yield the hidden state $h^\ell$ .
kvzap_model $_\ell$ produces importance scores.
KV pairs with scores above threshold $\tau$ (except sliding window $w$ ) are retained.

4.2 Decoding Pipeline

For steps $t=1...T_{dec}$ :

Compute $Q^t$ for the current hidden state.
Retrieve pruned $K_\mathrm{pruned}, V_\mathrm{pruned}$ for each layer/head.
Compute attention and output $h^t$ .
Update the per-head score buffer using kvzap_model $_\ell(h^t)$ , prune as necessary.
Generate the next token.

4.3 Surrogate Model Storage and Compute

KVzap-Linear stores $L$ weight matrices of size $(D_h \times H)$ in fp16 or int8; KVzap-MLP stores $2L$ matrices: $(D_h \times D_h/8)$ and $(D_h/8 \times H)$ . Relative FLOPs overhead per transformer layer:

Model	KVzap-MLP	KVzap-Linear
Qwen3-8B	1.09%	0.02%
Llama-3.1-8B	0.96%	0.02%
Qwen3-32B	0.67%	0.01%

5. Empirical Results Across Models and Tasks

Summary results demonstrate that KVzap maintains faithfulness and achieves competitive or state-of-the-art cache compression.

Model	KVzap Variant	$\tau$	RULER 4k	RULER 16k	LongBench	AIME25 pass@4	Avg. CR
Qwen3-8B	MLP	–4	95.32→95.09 (0.74)	92.99→92.78 (0.72)	46.74→46.49 (0.66)	0.77→0.77 (0.75)	3.5×
Llama-3.1-8B	Linear	–7	95.69→95.55 (0.68)	93.42→93.29 (0.70)	45.25→44.65 (0.62)	–	3.0×
Qwen3-32B	MLP	–4	95.65→95.95 (0.68)	95.19→94.96 (0.65)	50.56→50.40 (0.57)	0.83→0.87 (0.60)	2.7×

(Numbers “A→B(x)” show baseline → pruned accuracy and $x$ = fraction of cache kept.)

On RULER (4k/16k tokens), both MLP and Linear surrogates maintain ≥95% accuracy up to 3–4× CR, matching or slightly exceeding the KVzip+ oracle.
LongBench results show near-baseline performance (±0.2%) up to 2–3× CR, with adaptive CR in denser, real-world contexts.
On AIME25 reasoning, the MLP variant preserves pass@1/4 rates even when discarding over 50% of the cache.
Reduction of Cache by 3× directly translates to lowered memory traffic and increased decoding throughput in long-context generation, conditional on effective attention kernel support.

6. Insights, Ablations, and Leaderboard Placement

Threshold vs. Top- $k$ Pruning: Fixed $\tau$ introduces per-prompt compression variability (±20%) aligned with the information density of each prompt. This thresholding approach outperforms fixed top- $k$ methods across head/layer.
Sliding Window Ablation: Disabling recent-token preservation ( $w=0$ ) results in dramatic accuracy loss (28% on LongBench). Setting $w=128$ recovers near-baseline performance (62.5%), with no substantial gains beyond $w=512$ .
Surrogate Robustness: Despite moderate $R^2$ values (0.63–0.77), pruning is highly robust due to the stability of importance ranking.
Leaderboard Standing: On the NVIDIA KVpress leaderboard, KVzap variants offer the best trade-off between cache compression and RULER-4k accuracy across 15+ published methods, aligning deployable performance with best results formerly attainable only by academic, non-deployable approaches.

7. Significance in Production LLM Systems

KVzap fulfills key deployment criteria for LLM inference: rapid pruning based on information-aware surrogates, compatibility with decoding and streaming workloads, kernel-friendliness, and negligible reduction in model quality. By leveraging learned approximations of oracle-style scoring, it allows for significant memory savings and inference acceleration in practical, large-scale deployments without requiring model retraining or substantial engineering overhead (Jegou et al., 12 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

KVzap: Fast, Adaptive, and Faithful KV Cache Pruning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KVzap.