Cross-Layer KV Cache Reuse Techniques
- Cross-layer KV cache reuse is a set of techniques for sharing and compressing key–value caches across transformer layers to reduce memory cost and redundant computation.
- Methods such as fusion, singular value decomposition, and quantization exploit inter-layer similarities to boost throughput and enable long-context inference.
- System-level implementations extend these approaches across multi-query and multi-agent scenarios, optimizing cache placement to achieve significant speedup and memory reduction.
Cross-layer KV cache reuse encompasses a class of techniques for reducing the memory footprint and redundant computation in transformer-based LLMs by enabling the sharing, merging, compression, or reconstruction of key–value (KV) cache states across multiple transformer layers or across queries. This paradigm leverages both empirical inter-layer similarity and application-level redundancy, accelerating inference, increasing throughput, and directly enabling deployment in long-context and multi-agent regimes. Methods range from architectural strategies that reconstruct or fuse KV states at top layers, to algorithmic post-training approaches—such as singular value decomposition or quantization—and system-level interventions that manage cache placement and movement.
1. Motivations and Principles of Cross-Layer KV Cache Reuse
Transformer inference typically requires maintaining per-layer KV caches for all past tokens, resulting in memory costs scaling linearly with the number of layers and context length. For long contexts, the KV cache alone may exceed the model's parameter footprint. Layer-wise reuse seeks to address two distinct sources of inefficiency:
- Redundant computation across queries or tasks: Applications such as retrieval-augmented generation, multi-turn chat, or multi-agent pipelines frequently replay the same documents or textual segments, resulting in unnecessary context encoding (Yang et al., 21 Feb 2025, Ye et al., 14 Oct 2025).
- Structural redundancy across layers: Adjacent layers often encode the same hidden information in sufficiently similar subspaces, allowing KV cache merging or sharing (Lin et al., 3 Dec 2025, Chang et al., 24 Mar 2025, Wang et al., 22 Aug 2025).
Reuse is effective when attention relationships or latent representations remain stable across layers and queries. It is further strengthened by exploiting domain-specific patterns (e.g., chain-of-thought repetition in mathematical reasoning (Chen et al., 29 Jul 2025)) and by leveraging static or dynamic analysis to determine which layers or blocks can share cache without quality loss.
2. Architectural and Algorithmic Methods
Formal cross-layer KV reuse schemes fall into four main categories:
(a) Cache Reconstruction and Fusion
Architectural approaches reconstruct upper-layer caches as functions of lower/middle-layer representations, achieving substantial reductions. FusedKV introduces per-channel, learnable fusion of the most informative bottom and middle layer KV caches for top layers, preserving relative positional encoding by operating on post-RoPE keys. Its variant FusedKV-Lite applies direct asymmetric sharing: top-layer values are copied from the bottom layer, while keys are taken from the middle layer (Lin et al., 3 Dec 2025).
1 2 |
K^{(i)} = w_{i,1} ⊙ K^{(1)} + w_{i,n} ⊙ K^{(n)},
V^{(i)} = v_{i,1} ⊙ V^{(1)} + v_{i,n} ⊙ V^{(n)} |
Fusion maintains unique layer features, with empirical results showing up to 50 % cache reduction at non-inferior validation perplexity compared to full-cache models.
(b) SVD-Based Cross-Layer Parameter Sharing
CommonKV and xKV apply singular value decomposition (SVD) over concatenated key/value projection matrices of adjacent layers or larger layer groups. This exposes a shared low-rank latent subspace, into which token representations are projected before being linearly mapped back to per-layer keys/values (Wang et al., 22 Aug 2025, Chang et al., 24 Mar 2025).
- Grouping: Layers partitioned into groups; group SVD yields shared basis.
- Latent cache: Tokens are stored in latent space; layer-specific keys/values are reconstructed via linear mapping.
- Adaptive merging: CommonKV incorporates cosine similarity and Fisher information to dynamically allocate merging budgets across groups.
Empirical compression rates reach 50–98 % (when combined with quantization/eviction) with minimal accuracy drop.
(c) Quantization and Cross-Layer Integer Sharing
XQuant partitions layers into groups and applies quantization with data-free calibration, storing only one shared integer cache per group and reconstructing per-layer values through layer-dependent zero-point and scale parameters (Yang et al., 13 Oct 2025).
- Sub-1.4 bit storage: Effective bit-width is below 1.4 with maintained or improved accuracy versus full-precision and other quantization baselines.
- Accelerated variant: Quantizes only the dominant layer in each group; subordinate layers reuse via parameterized dequantization.
Group sizes beyond two yield diminishing returns; empirical benchmarks favor G=2.
(d) Proxy Layer Sharing and Layer Mapping
YOCO, CLA, and unified frameworks (e.g., Pizza/Sandwich/Lasagna configurations) model the sharing as a mapping kv: layer → source layer, with proxy layers using keys/values from nearby layers. Aggressive reduction (up to 2x) can closely approach baseline accuracy with up to 1.5x throughput improvement. More substantial reduction requires pairing proxies to upper layers, with moderate prefill and training overhead (Wu et al., 2024, Lin et al., 3 Dec 2025).
3. System and Serving Layer Techniques
(a) Context-Level Cache Orchestration
KVLink, LMCache, AdaptCache, and KVComm extend KV cache sharing to the system level, enabling efficient orchestration across queries, sessions, and storage hierarchies (Yang et al., 21 Feb 2025, Cheng et al., 8 Oct 2025, Feng et al., 28 Aug 2025, Ye et al., 14 Oct 2025).
- Indexed KV cache pooling: LMCache exposes a standardized API for lookup, pinning, movement, and compression across GPU/CPU/NVMe/network layers, supporting both prefix reuse and prefill-decode disaggregation.
- Enterprise deployment: LMCache and AdaptCache optimize the physical placement and compression of KV entries in DRAM/SSD using greedy multi-choice knapsack formulations, achieving up to 15x throughput and 56 % delay cuts.
- Cross-query/context alignment: KVLink independently precomputes document-level KV caches and at inference concatenates, realigns positional encodings, and injects link tokens to restore self-attention. KVComm interpolates context offsets using online anchor pools for multi-agent systems, enabling up to 7.8x speedup with up to 95 % reuse rate.
Table: System-level KV cache techniques
| Framework | Reuse Domain | Compression | Notable Metric |
|---|---|---|---|
| KVLink | Retrieved docs | Yes | TTFT ↓96 %, accuracy +4 % |
| LMCache | Engines/queries | Yes | Throughput ↑15 × |
| AdaptCache | Storage tiers | Yes | Delay ↓2.4 × @same quality |
| KVComm | Multi-agent | No | TTFT ↓7.8 × |
(b) Infilling and Interactive Prompt Transformation
EFIM refines fill-in-the-middle prompt formats for cross-request KV reuse in interactive infilling, preserving both prefix and suffix caches across sessions by carefully transforming prompts and training on fragment-tokenized data. This nearly doubles throughput (↓52% latency, ↑98% throughput) with high infilling accuracy (Guo et al., 28 May 2025).
4. Specialized Algorithms: Memory-Efficient Inference and Eviction
(a) Collaborative Filtering and Block-Level Zero-Copy
MemShare targets large reasoning models by using a collaborative filtering algorithm: it first filters by step-level lexical similarity, then performs block-level Euclidean distance matching across layers to enable zero-copy pointer-sharing in paged memory attention. This achieves up to 85 % throughput increase with 20–40 % cache memory reduction (Chen et al., 29 Jul 2025).
(b) Adaptive Layer-Wise Preference and Eviction
CAKE allocates cache slots to layers based on spatial and temporal attention metrics (entropy, variance), solves the allocation as a proportional cake-slicing optimization, and cascades eviction across layers during prefill and decoding. This adaptive cross-layer strategy consistently outperforms uniform or fixed-pattern baselines under tight budgets, yielding up to 10× speedup at 128K context (Qin et al., 16 Mar 2025).
5. Practical Implementations and Trade-offs
- Preprocessing and online cost: SVD-based approaches can amortize decomposition cost offline; greedy system optimizers update on insertion or under overload conditions (AdaptCache, CommonKV).
- Decoding throughput: Proxy layer mapping, adaptive block-reuse, and fused reconstructions generally avoid any runtime penalty, yielding marked improvements in TTFT and overall system throughput.
- Accuracy versus memory: Aggressive cache sharing or quantization (>60% reduction, <1.2 bits) may incur task- or model-dependent performance drop. Most techniques enable tuning of merging rank, quantization bits, or mapping strategy to optimize the memory–accuracy frontier.
Table: Empirical throughput and memory savings
| Method | Compression Rate | Throughput Gain | Accuracy Loss |
|---|---|---|---|
| FusedKV | 50 % | ~2 × | ≤0.5 % |
| xKV | 6.8 × | 2–3 × | –2.7 % to +2.7 % |
| MemShare | 20–40 % | 50–85 % | <2.5 % |
| AdaptCache | – | 1.43–2.4 × (TTFT ↓) | 0–15 % (user-set) |
| EFIM | – | 98 % | None |
6. Limitations, Hybridization, and Research Directions
- Applicability: Some methods require architectural modification and/or retraining (YOCO, CLA, FusedKV). Others are training-free (CommonKV, xKV, XQuant, AdaptCache).
- Orthogonality: Many cross-layer techniques (CommonKV, KVSharer, KV-CAR) can be combined with intra-layer quantization, head reduction, or eviction for synergistic gains (Wang et al., 22 Aug 2025, Yang et al., 2024, Roy et al., 7 Dec 2025).
- Granularity: Block-level sharing (MemShare, paged attention) operates at 128 tokens/granule; head-level (KV-CAR) at individual attention heads; group-level (xKV/CommonKV) at clusters of layers.
- Open challenges: Developing adaptive, context-sensitive mappings; unifying system-level orchestration across dynamic hardware/cache constraints; extending robust cross-layer reuse to multi-modal, retrieval, or streaming frameworks.
7. Concluding Insights
Cross-layer KV cache reuse represents a convergence of model design, algorithmic optimization, and system engineering to address the core bottleneck of large-context transformer inference. By exploiting inter-layer, inter-query, and inter-agent redundancy, and combining fusion, singular value decomposition, quantization, block-level pointer reuse, and adaptive system placement, these methods collectively deliver order-of-magnitude improvements in speed and memory—often without significant loss in accuracy or generality. The modularity and composability of recent approaches such as CommonKV, KVLink, LMCache, and xKV strongly suggest a path toward unified cache-aware inference fabrics for next-generation LLM serving and deployment.
Key References:
KVLink (Yang et al., 21 Feb 2025); EFIM (Guo et al., 28 May 2025); LMCache (Cheng et al., 8 Oct 2025); MemShare (Chen et al., 29 Jul 2025); FusedKV (Lin et al., 3 Dec 2025); XQuant (Yang et al., 13 Oct 2025); CommonKV (Wang et al., 22 Aug 2025); CAKE (Qin et al., 16 Mar 2025); KVSharer (Yang et al., 2024); xKV (Chang et al., 24 Mar 2025); CLLA (Yang et al., 2024); KVCOMM (Ye et al., 14 Oct 2025); AdaptCache (Feng et al., 28 Aug 2025); KV-CAR (Roy et al., 7 Dec 2025); Systematic Framework (Wu et al., 2024).