Papers
Topics
Authors
Recent
Search
2000 character limit reached

EpiCache Framework: Memory-Efficient KV Cache Management

Updated 17 February 2026
  • EpiCache is a training-free key-value cache management framework that combines block-wise prefill eviction, episodic KV compression, and adaptive layer-wise budget allocation.
  • It caps GPU memory growth in long-context conversational QA while maintaining over 95% baseline accuracy under 4–6× compression.
  • The framework significantly reduces decoding latency and memory usage, outperforming contemporary cache strategies in multi-turn dialogue settings.

EpiCache is a training-free Key-Value (KV) cache management framework designed to enable efficient, long-context conversational question answering (LongConvQA) with LLMs under strict and constant memory budgets. The framework addresses the unbounded, linear growth of KV caches during multi-turn, extended conversations by combining three core innovations: block-wise prefill eviction, episodic KV compression, and adaptive layer-wise budget allocation. EpiCache maintains near-full-KV accuracy—retaining up to 95% of baseline accuracy under 4–6× compression—while drastically reducing GPU memory and decoding latency (Kim et al., 22 Sep 2025).

1. Problem Formulation: KV Cache Growth in LongConvQA

Autoregressive LLMs use a KV cache at each transformer layer to store past attention keys and values, such that for a dialogue of NN tokens, LL layers, and HH heads, the cache scales as ∥KV∥=L×H×N\|KV\| = L \times H \times N. In LongConvQA settings, where dialogue length NN becomes very large (hundreds of turns or multi-day sessions), the memory required for storing KV caches quickly exhausts available GPU RAM: Mpeak≈O(N)M_{\text{peak}} \approx O(N). Traditional approaches like post-prefill eviction or query-dependent cache compression either induce unbounded peak memory or collapse relevant context, leading to degraded multi-turn accuracy. The goal is to satisfy, under a strict per-layer token budget MM, that for all queries qiq_i,

fLLM(qi∣KVHC)≈fLLM(qi∣KVH),with∥KVHC∥≤L×H×M,f_{\text{LLM}}(q_i \mid KV_H^C) \approx f_{\text{LLM}}(q_i \mid KV_H), \quad \text{with} \quad \|KV_H^C\| \leq L \times H \times M,

throughout prefill and decode, thus bounding peak memory across arbitrarily long conversations (Kim et al., 22 Sep 2025).

2. EpiCache Framework Overview

EpiCache constrains KV cache growth using a combination of block-wise prefill eviction, episodic KV compression, and adaptive layer-wise budget allocation:

  • Block-wise Prefill Eviction processes the conversation in blocks of size BB, evicting cache entries to maintain size LL0 at each block boundary.
  • Episodic KV Compression segments the conversation into LL1 topical episodes via clustering, building separate compressed KV caches for each.
  • Adaptive Layer-wise Budget Allocation assigns memory budget per layer according to measured sensitivity to key eviction, preserving representational integrity where most critical.

This ensemble of techniques allows EpiCache to cap GPU memory, preserve topic-relevant context, and satisfy the requirements of multi-turn dialogue systems without retraining (Kim et al., 22 Sep 2025).

3. Key Algorithms and Techniques

3.1 Block-wise Prefill Eviction

Block-wise prefill divides the input into non-overlapping blocks of LL2 tokens. After each prefill, token importance scores LL3 are computed (method in Section 4), and only the top-LL4 tokens by LL5 are retained; the rest are evicted. The process guarantees that the peak cache never exceeds LL6, regardless of the total dialogue history length.

Pseudocode summary:

MM9

This method ensures a flat GPU memory profile as LL7 increases, essential for deployment on resource-constrained hardware (Kim et al., 22 Sep 2025).

3.2 Episodic KV Compression

3.2.1 Clustering into Episodes

Dialogue history LL8 is partitioned into LL9 segments HH0, each with HH1 utterances. Segment embeddings HH2 are computed using a lightweight encoder (e.g., MiniLM, Qwen3). HH3-Means clusters HH4 are formed, and each cluster’s medoid is selected as the representative prompt HH5.

3.2.2 Episode-Specific Eviction

For each episode HH6, block-wise prefill is rerun over the full history, appending HH7 after each block to bias attention scoring toward the episode’s topic. Token importance HH8 is computed as

HH9

ensuring selection of tokens most relevant to the corresponding episode. The top-∥KV∥=L×H×N\|KV\| = L \times H \times N0 tokens are retained to compose the episode’s compressed cache ∥KV∥=L×H×N\|KV\| = L \times H \times N1.

3.2.3 Online Retrieval and Decoding

At inference, each query ∥KV∥=L×H×N\|KV\| = L \times H \times N2 is embedded (∥KV∥=L×H×N\|KV\| = L \times H \times N3) and matched to the closest episode centroid. The associated compressed cache ∥KV∥=L×H×N\|KV\| = L \times H \times N4 is loaded into GPU and used during decoding.

3.3 Adaptive Layer-wise Budget Allocation

Layer sensitivity to key eviction is determined by measuring representational drift:

∥KV∥=L×H×N\|KV\| = L \times H \times N5

where ∥KV∥=L×H×N\|KV\| = L \times H \times N6 and ∥KV∥=L×H×N\|KV\| = L \times H \times N7 denote key states with and without block-prefill eviction, respectively. Sharpening sensitivities via exponent ∥KV∥=L×H×N\|KV\| = L \times H \times N8 (∥KV∥=L×H×N\|KV\| = L \times H \times N9–NN0), the total KV token budget is allocated as

NN1

assigning larger capacities to layers with higher sensitivity and thus maintaining QA performance (Kim et al., 22 Sep 2025).

4. Theoretical Guarantees and Empirical Results

  • Memory Guarantee: Peak KV cache size is always NN2, ensuring constant GPU memory usage NN3 independent of conversation length.
  • Empirical Accuracy Bound: Under compression ratios of NN4–NN5 (NN6), EpiCache achieves NN7 of full-KV QA accuracy across benchmarks.
  • Latency and Memory Reduction: On LLaMA-3.2-3B (at NN8K), per-turn latency is reduced by up to NN9 (from Mpeak≈O(N)M_{\text{peak}} \approx O(N)0 ms to Mpeak≈O(N)M_{\text{peak}} \approx O(N)1 ms), and peak memory by up to Mpeak≈O(N)M_{\text{peak}} \approx O(N)2 (from Mpeak≈O(N)M_{\text{peak}} \approx O(N)3 GB to Mpeak≈O(N)M_{\text{peak}} \approx O(N)4 GB).
  • Scalability: For contexts up to Mpeak≈O(N)M_{\text{peak}} \approx O(N)5K tokens (Mpeak≈O(N)M_{\text{peak}} \approx O(N)6K), EpiCache approaches uncompressed performance, outperforming StreamingLLM, SnapKV, InfiniPot, KeyDiff, and KVzip-adapted baselines, which degrade at scale.

The following table summarizes benchmark results for EpiCache (Mpeak≈O(N)M_{\text{peak}} \approx O(N)7K, LLaMA-3.2-3B):

Benchmark Accuracy Gain over Baseline Compression Ratio Latency Reduction (×) Memory Reduction (×)
Realtalk, LoCoMo, LongMemEval +20–40% absolute 4–6× up to 2.4 up to 3.5

(Kim et al., 22 Sep 2025)

5. Implementation Details and Practical Considerations

  • Block Size Mpeak≈O(N)M_{\text{peak}} \approx O(N)8: Larger Mpeak≈O(N)M_{\text{peak}} \approx O(N)9 yields faster coverage at the expense of higher temporary memory; smaller MM0 minimizes peak but slows prefill.
  • Episode Count MM1: Higher MM2 (more episodes) can increase recall under tight memory but raises offline cache storage requirements.
  • Embedding Strategies: Accuracy is not highly sensitive to the embedding model; compact encoders suffice for practical segmentation.
  • Retrieval Overhead: Embedding, centroid matching, and KV loading contribute to MM3 of per-turn latency, as episode switches are infrequent.
  • Limitations: Fixed episode count MM4 and offline clustering; no support for dynamic MM5 or compressed/quantized KV storage. Performance depends on clustering quality and selection of MM6.

Contemporary baselines such as StreamingLLM, SnapKV, InfiniPot, KeyDiff, KVzip, and context manipulation techniques (e.g., PrefixLM, position-independent caching as in EPIC (Hu et al., 2024)) also target memory-constrained LLM inference. Unlike query-dependent strategies that collapse all context to a single query (often harming multi-turn coherence), EpiCache preserves topic- and episode-relevant context and amortizes cache management across episodes, achieving superior QA accuracy and resource efficiency.

7. Limitations and Future Directions

Current constraints of EpiCache include the static selection of episode count MM7, reliance on offline clustering, and the lack of quantized/compact KV representations. Future research directions proposed include dynamic adaptation of MM8, integration of KV-quantization schemes, and improvements in conversational segmentation. These extensions may further reduce memory requirements and enable finer-grained, context-adaptive compression without sacrificing QA fidelity (Kim et al., 22 Sep 2025). A plausible implication is that incorporating quantization or dynamic clustering could push the efficiency-accuracy frontier further in long-context LLM serving.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EpiCache Framework.