Papers
Topics
Authors
Recent
Search
2000 character limit reached

EpiCache Framework: Memory-Efficient KV Cache Management

Updated 17 February 2026
  • EpiCache is a training-free key-value cache management framework that combines block-wise prefill eviction, episodic KV compression, and adaptive layer-wise budget allocation.
  • It caps GPU memory growth in long-context conversational QA while maintaining over 95% baseline accuracy under 4–6× compression.
  • The framework significantly reduces decoding latency and memory usage, outperforming contemporary cache strategies in multi-turn dialogue settings.

EpiCache is a training-free Key-Value (KV) cache management framework designed to enable efficient, long-context conversational question answering (LongConvQA) with LLMs under strict and constant memory budgets. The framework addresses the unbounded, linear growth of KV caches during multi-turn, extended conversations by combining three core innovations: @@@@1@@@@ eviction, episodic KV compression, and adaptive layer-wise budget allocation. EpiCache maintains near-full-KV accuracy—retaining up to 95% of baseline accuracy under 4–6× compression—while drastically reducing GPU memory and decoding latency (Kim et al., 22 Sep 2025).

1. Problem Formulation: KV Cache Growth in LongConvQA

Autoregressive LLMs use a KV cache at each transformer layer to store past attention keys and values, such that for a dialogue of NN tokens, LL layers, and HH heads, the cache scales as KV=L×H×N\|KV\| = L \times H \times N. In LongConvQA settings, where dialogue length NN becomes very large (hundreds of turns or multi-day sessions), the memory required for storing KV caches quickly exhausts available GPU RAM: MpeakO(N)M_{\text{peak}} \approx O(N). Traditional approaches like post-prefill eviction or query-dependent cache compression either induce unbounded peak memory or collapse relevant context, leading to degraded multi-turn accuracy. The goal is to satisfy, under a strict per-layer token budget MM, that for all queries qiq_i,

fLLM(qiKVHC)fLLM(qiKVH),withKVHCL×H×M,f_{\text{LLM}}(q_i \mid KV_H^C) \approx f_{\text{LLM}}(q_i \mid KV_H), \quad \text{with} \quad \|KV_H^C\| \leq L \times H \times M,

throughout prefill and decode, thus bounding peak memory across arbitrarily long conversations (Kim et al., 22 Sep 2025).

2. EpiCache Framework Overview

EpiCache constrains KV cache growth using a combination of block-wise prefill eviction, episodic KV compression, and adaptive layer-wise budget allocation:

  • Block-wise Prefill Eviction processes the conversation in blocks of size BB, evicting cache entries to maintain size M+BM+B at each block boundary.
  • Episodic KV Compression segments the conversation into EE topical episodes via clustering, building separate compressed KV caches for each.
  • Adaptive Layer-wise Budget Allocation assigns memory budget per layer according to measured sensitivity to key eviction, preserving representational integrity where most critical.

This ensemble of techniques allows EpiCache to cap GPU memory, preserve topic-relevant context, and satisfy the requirements of multi-turn dialogue systems without retraining (Kim et al., 22 Sep 2025).

3. Key Algorithms and Techniques

3.1 Block-wise Prefill Eviction

Block-wise prefill divides the input into non-overlapping blocks of BB tokens. After each prefill, token importance scores sis_i are computed (method in Section 4), and only the top-MM tokens by sis_i are retained; the rest are evicted. The process guarantees that the peak cache never exceeds M+BM+B, regardless of the total dialogue history length.

Pseudocode summary:

1
2
3
4
5
6
7
Input: history H, total tokens N, budget M, block size B
KV_cache ← ∅
for t in 1…N step B:
    prefill B tokens x_{t:t+B-1}, update KV_cache
    compute s_i for tokens in KV_cache
    keep top-M by s_i
end for

This method ensures a flat GPU memory profile as NN increases, essential for deployment on resource-constrained hardware (Kim et al., 22 Sep 2025).

3.2 Episodic KV Compression

3.2.1 Clustering into Episodes

Dialogue history HH is partitioned into KK segments {Sk}\{S_k\}, each with wembedw_{\text{embed}} utterances. Segment embeddings eke_k are computed using a lightweight encoder (e.g., MiniLM, Qwen3). KK-Means clusters {Ce}e=1E\{C_e\}_{e=1}^E are formed, and each cluster’s medoid is selected as the representative prompt PeP_e.

3.2.2 Episode-Specific Eviction

For each episode ee, block-wise prefill is rerun over the full history, appending PeP_e after each block to bias attention scoring toward the episode’s topic. Token importance sis_i is computed as

si=maxtpatchAttn(xtxi),s_i = \max_{t \in \text{patch}} Attn(x_t \rightarrow x_i),

ensuring selection of tokens most relevant to the corresponding episode. The top-MM tokens are retained to compose the episode’s compressed cache KVeCKV^C_e.

3.2.3 Online Retrieval and Decoding

At inference, each query qiq_i is embedded (ui=fembed(qi)u_i = f_{\text{embed}}(q_i)) and matched to the closest episode centroid. The associated compressed cache KVe^CKV^C_{\hat{e}} is loaded into GPU and used during decoding.

3.3 Adaptive Layer-wise Budget Allocation

Layer sensitivity to key eviction is determined by measuring representational drift:

sl=11HNh=1Hi=1Ncos(Kl,h,ifull,Kl,h,iblock),s_l = 1 - \frac{1}{H \cdot N} \sum_{h=1}^H \sum_{i=1}^N \cos(K^{\text{full}}_{l,h,i}, K^{\text{block}}_{l,h,i}),

where Kl,h,ifullK^{\text{full}}_{l,h,i} and Kl,h,iblockK^{\text{block}}_{l,h,i} denote key states with and without block-prefill eviction, respectively. Sharpening sensitivities via exponent α\alpha (α1.1\alpha \approx 1.1–$1.3$), the total KV token budget is allocated as

Ml=slαj=1LsjαM×L,M_l = \frac{s_l^\alpha}{\sum_{j=1}^L s_j^\alpha} \cdot M \times L,

assigning larger capacities to layers with higher sensitivity and thus maintaining QA performance (Kim et al., 22 Sep 2025).

4. Theoretical Guarantees and Empirical Results

  • Memory Guarantee: Peak KV cache size is always M+B\leq M + B, ensuring constant GPU memory usage O(M)O(M) independent of conversation length.
  • Empirical Accuracy Bound: Under compression ratios of $4$–6×6\times (MBN/6M \approx B \approx N/6), EpiCache achieves >95%>95\% of full-KV QA accuracy across benchmarks.
  • Latency and Memory Reduction: On LLaMA-3.2-3B (at M=4M = 4K), per-turn latency is reduced by up to 2.4×2.4\times (from $68.9$ ms to $28.1$ ms), and peak memory by up to 3.5×3.5\times (from $28.4$ GB to $8.2$ GB).
  • Scalability: For contexts up to $100$K tokens (M=6M = 6K), EpiCache approaches uncompressed performance, outperforming StreamingLLM, SnapKV, InfiniPot, KeyDiff, and KVzip-adapted baselines, which degrade at scale.

The following table summarizes benchmark results for EpiCache (M=4M=4K, LLaMA-3.2-3B):

Benchmark Accuracy Gain over Baseline Compression Ratio Latency Reduction (×) Memory Reduction (×)
Realtalk, LoCoMo, LongMemEval +20–40% absolute 4–6× up to 2.4 up to 3.5

(Kim et al., 22 Sep 2025)

5. Implementation Details and Practical Considerations

  • Block Size BB: Larger BB yields faster coverage at the expense of higher temporary memory; smaller BB minimizes peak but slows prefill.
  • Episode Count EE: Higher EE (more episodes) can increase recall under tight memory but raises offline cache storage requirements.
  • Embedding Strategies: Accuracy is not highly sensitive to the embedding model; compact encoders suffice for practical segmentation.
  • Retrieval Overhead: Embedding, centroid matching, and KV loading contribute to <5%<5\% of per-turn latency, as episode switches are infrequent.
  • Limitations: Fixed episode count EE and offline clustering; no support for dynamic EE or compressed/quantized KV storage. Performance depends on clustering quality and selection of wembedw_{\text{embed}}.

Contemporary baselines such as StreamingLLM, SnapKV, InfiniPot, KeyDiff, KVzip, and context manipulation techniques (e.g., PrefixLM, position-independent caching as in EPIC (Hu et al., 2024)) also target memory-constrained LLM inference. Unlike query-dependent strategies that collapse all context to a single query (often harming multi-turn coherence), EpiCache preserves topic- and episode-relevant context and amortizes cache management across episodes, achieving superior QA accuracy and resource efficiency.

7. Limitations and Future Directions

Current constraints of EpiCache include the static selection of episode count EE, reliance on offline clustering, and the lack of quantized/compact KV representations. Future research directions proposed include dynamic adaptation of EE, integration of KV-quantization schemes, and improvements in conversational segmentation. These extensions may further reduce memory requirements and enable finer-grained, context-adaptive compression without sacrificing QA fidelity (Kim et al., 22 Sep 2025). A plausible implication is that incorporating quantization or dynamic clustering could push the efficiency-accuracy frontier further in long-context LLM serving.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EpiCache Framework.