EpiCache Framework: Memory-Efficient KV Cache Management
- EpiCache is a training-free key-value cache management framework that combines block-wise prefill eviction, episodic KV compression, and adaptive layer-wise budget allocation.
- It caps GPU memory growth in long-context conversational QA while maintaining over 95% baseline accuracy under 4–6× compression.
- The framework significantly reduces decoding latency and memory usage, outperforming contemporary cache strategies in multi-turn dialogue settings.
EpiCache is a training-free Key-Value (KV) cache management framework designed to enable efficient, long-context conversational question answering (LongConvQA) with LLMs under strict and constant memory budgets. The framework addresses the unbounded, linear growth of KV caches during multi-turn, extended conversations by combining three core innovations: @@@@1@@@@ eviction, episodic KV compression, and adaptive layer-wise budget allocation. EpiCache maintains near-full-KV accuracy—retaining up to 95% of baseline accuracy under 4–6× compression—while drastically reducing GPU memory and decoding latency (Kim et al., 22 Sep 2025).
1. Problem Formulation: KV Cache Growth in LongConvQA
Autoregressive LLMs use a KV cache at each transformer layer to store past attention keys and values, such that for a dialogue of tokens, layers, and heads, the cache scales as . In LongConvQA settings, where dialogue length becomes very large (hundreds of turns or multi-day sessions), the memory required for storing KV caches quickly exhausts available GPU RAM: . Traditional approaches like post-prefill eviction or query-dependent cache compression either induce unbounded peak memory or collapse relevant context, leading to degraded multi-turn accuracy. The goal is to satisfy, under a strict per-layer token budget , that for all queries ,
throughout prefill and decode, thus bounding peak memory across arbitrarily long conversations (Kim et al., 22 Sep 2025).
2. EpiCache Framework Overview
EpiCache constrains KV cache growth using a combination of block-wise prefill eviction, episodic KV compression, and adaptive layer-wise budget allocation:
- Block-wise Prefill Eviction processes the conversation in blocks of size , evicting cache entries to maintain size at each block boundary.
- Episodic KV Compression segments the conversation into topical episodes via clustering, building separate compressed KV caches for each.
- Adaptive Layer-wise Budget Allocation assigns memory budget per layer according to measured sensitivity to key eviction, preserving representational integrity where most critical.
This ensemble of techniques allows EpiCache to cap GPU memory, preserve topic-relevant context, and satisfy the requirements of multi-turn dialogue systems without retraining (Kim et al., 22 Sep 2025).
3. Key Algorithms and Techniques
3.1 Block-wise Prefill Eviction
Block-wise prefill divides the input into non-overlapping blocks of tokens. After each prefill, token importance scores are computed (method in Section 4), and only the top- tokens by are retained; the rest are evicted. The process guarantees that the peak cache never exceeds , regardless of the total dialogue history length.
Pseudocode summary:
1 2 3 4 5 6 7 |
Input: history H, total tokens N, budget M, block size B
KV_cache ← ∅
for t in 1…N step B:
prefill B tokens x_{t:t+B-1}, update KV_cache
compute s_i for tokens in KV_cache
keep top-M by s_i
end for |
This method ensures a flat GPU memory profile as increases, essential for deployment on resource-constrained hardware (Kim et al., 22 Sep 2025).
3.2 Episodic KV Compression
3.2.1 Clustering into Episodes
Dialogue history is partitioned into segments , each with utterances. Segment embeddings are computed using a lightweight encoder (e.g., MiniLM, Qwen3). -Means clusters are formed, and each cluster’s medoid is selected as the representative prompt .
3.2.2 Episode-Specific Eviction
For each episode , block-wise prefill is rerun over the full history, appending after each block to bias attention scoring toward the episode’s topic. Token importance is computed as
ensuring selection of tokens most relevant to the corresponding episode. The top- tokens are retained to compose the episode’s compressed cache .
3.2.3 Online Retrieval and Decoding
At inference, each query is embedded () and matched to the closest episode centroid. The associated compressed cache is loaded into GPU and used during decoding.
3.3 Adaptive Layer-wise Budget Allocation
Layer sensitivity to key eviction is determined by measuring representational drift:
where and denote key states with and without block-prefill eviction, respectively. Sharpening sensitivities via exponent (–$1.3$), the total KV token budget is allocated as
assigning larger capacities to layers with higher sensitivity and thus maintaining QA performance (Kim et al., 22 Sep 2025).
4. Theoretical Guarantees and Empirical Results
- Memory Guarantee: Peak KV cache size is always , ensuring constant GPU memory usage independent of conversation length.
- Empirical Accuracy Bound: Under compression ratios of $4$– (), EpiCache achieves of full-KV QA accuracy across benchmarks.
- Latency and Memory Reduction: On LLaMA-3.2-3B (at K), per-turn latency is reduced by up to (from $68.9$ ms to $28.1$ ms), and peak memory by up to (from $28.4$ GB to $8.2$ GB).
- Scalability: For contexts up to $100$K tokens (K), EpiCache approaches uncompressed performance, outperforming StreamingLLM, SnapKV, InfiniPot, KeyDiff, and KVzip-adapted baselines, which degrade at scale.
The following table summarizes benchmark results for EpiCache (K, LLaMA-3.2-3B):
| Benchmark | Accuracy Gain over Baseline | Compression Ratio | Latency Reduction (×) | Memory Reduction (×) |
|---|---|---|---|---|
| Realtalk, LoCoMo, LongMemEval | +20–40% absolute | 4–6× | up to 2.4 | up to 3.5 |
5. Implementation Details and Practical Considerations
- Block Size : Larger yields faster coverage at the expense of higher temporary memory; smaller minimizes peak but slows prefill.
- Episode Count : Higher (more episodes) can increase recall under tight memory but raises offline cache storage requirements.
- Embedding Strategies: Accuracy is not highly sensitive to the embedding model; compact encoders suffice for practical segmentation.
- Retrieval Overhead: Embedding, centroid matching, and KV loading contribute to of per-turn latency, as episode switches are infrequent.
- Limitations: Fixed episode count and offline clustering; no support for dynamic or compressed/quantized KV storage. Performance depends on clustering quality and selection of .
6. Related Work
Contemporary baselines such as StreamingLLM, SnapKV, InfiniPot, KeyDiff, KVzip, and context manipulation techniques (e.g., PrefixLM, position-independent caching as in EPIC (Hu et al., 2024)) also target memory-constrained LLM inference. Unlike query-dependent strategies that collapse all context to a single query (often harming multi-turn coherence), EpiCache preserves topic- and episode-relevant context and amortizes cache management across episodes, achieving superior QA accuracy and resource efficiency.
7. Limitations and Future Directions
Current constraints of EpiCache include the static selection of episode count , reliance on offline clustering, and the lack of quantized/compact KV representations. Future research directions proposed include dynamic adaptation of , integration of KV-quantization schemes, and improvements in conversational segmentation. These extensions may further reduce memory requirements and enable finer-grained, context-adaptive compression without sacrificing QA fidelity (Kim et al., 22 Sep 2025). A plausible implication is that incorporating quantization or dynamic clustering could push the efficiency-accuracy frontier further in long-context LLM serving.