EpiCache Framework: Memory-Efficient KV Cache Management

Updated 17 February 2026

EpiCache is a training-free key-value cache management framework that combines block-wise prefill eviction, episodic KV compression, and adaptive layer-wise budget allocation.
It caps GPU memory growth in long-context conversational QA while maintaining over 95% baseline accuracy under 4–6× compression.
The framework significantly reduces decoding latency and memory usage, outperforming contemporary cache strategies in multi-turn dialogue settings.

EpiCache is a training-free Key-Value (KV) cache management framework designed to enable efficient, long-context conversational question answering (LongConvQA) with LLMs under strict and constant memory budgets. The framework addresses the unbounded, linear growth of KV caches during multi-turn, extended conversations by combining three core innovations: block-wise prefill eviction, episodic KV compression, and adaptive layer-wise budget allocation. EpiCache maintains near-full-KV accuracy—retaining up to 95% of baseline accuracy under 4–6× compression—while drastically reducing GPU memory and decoding latency (Kim et al., 22 Sep 2025).

1. Problem Formulation: KV Cache Growth in LongConvQA

Autoregressive LLMs use a KV cache at each transformer layer to store past attention keys and values, such that for a dialogue of $N$ tokens, $L$ layers, and $H$ heads, the cache scales as $\|KV\| = L \times H \times N$ . In LongConvQA settings, where dialogue length $N$ becomes very large (hundreds of turns or multi-day sessions), the memory required for storing KV caches quickly exhausts available GPU RAM: $M_{\text{peak}} \approx O(N)$ . Traditional approaches like post-prefill eviction or query-dependent cache compression either induce unbounded peak memory or collapse relevant context, leading to degraded multi-turn accuracy. The goal is to satisfy, under a strict per-layer token budget $M$ , that for all queries $q_i$ ,

$f_{\text{LLM}}(q_i \mid KV_H^C) \approx f_{\text{LLM}}(q_i \mid KV_H), \quad \text{with} \quad \|KV_H^C\| \leq L \times H \times M,$

throughout prefill and decode, thus bounding peak memory across arbitrarily long conversations (Kim et al., 22 Sep 2025).

2. EpiCache Framework Overview

EpiCache constrains KV cache growth using a combination of block-wise prefill eviction, episodic KV compression, and adaptive layer-wise budget allocation:

Block-wise Prefill Eviction processes the conversation in blocks of size $B$ , evicting cache entries to maintain size $L$ 0 at each block boundary.
Episodic KV Compression segments the conversation into $L$ 1 topical episodes via clustering, building separate compressed KV caches for each.
Adaptive Layer-wise Budget Allocation assigns memory budget per layer according to measured sensitivity to key eviction, preserving representational integrity where most critical.

This ensemble of techniques allows EpiCache to cap GPU memory, preserve topic-relevant context, and satisfy the requirements of multi-turn dialogue systems without retraining (Kim et al., 22 Sep 2025).

3. Key Algorithms and Techniques

3.1 Block-wise Prefill Eviction

Block-wise prefill divides the input into non-overlapping blocks of $L$ 2 tokens. After each prefill, token importance scores $L$ 3 are computed (method in Section 4), and only the top- $L$ 4 tokens by $L$ 5 are retained; the rest are evicted. The process guarantees that the peak cache never exceeds $L$ 6, regardless of the total dialogue history length.

Pseudocode summary:

$M$ 9

This method ensures a flat GPU memory profile as $L$ 7 increases, essential for deployment on resource-constrained hardware (Kim et al., 22 Sep 2025).

3.2 Episodic KV Compression

3.2.1 Clustering into Episodes

Dialogue history $L$ 8 is partitioned into $L$ 9 segments $H$ 0, each with $H$ 1 utterances. Segment embeddings $H$ 2 are computed using a lightweight encoder (e.g., MiniLM, Qwen3). $H$ 3-Means clusters $H$ 4 are formed, and each cluster’s medoid is selected as the representative prompt $H$ 5.

3.2.2 Episode-Specific Eviction

For each episode $H$ 6, block-wise prefill is rerun over the full history, appending $H$ 7 after each block to bias attention scoring toward the episode’s topic. Token importance $H$ 8 is computed as

$H$ 9

ensuring selection of tokens most relevant to the corresponding episode. The top- $\|KV\| = L \times H \times N$ 0 tokens are retained to compose the episode’s compressed cache $\|KV\| = L \times H \times N$ 1.

3.2.3 Online Retrieval and Decoding

At inference, each query $\|KV\| = L \times H \times N$ 2 is embedded ( $\|KV\| = L \times H \times N$ 3) and matched to the closest episode centroid. The associated compressed cache $\|KV\| = L \times H \times N$ 4 is loaded into GPU and used during decoding.

3.3 Adaptive Layer-wise Budget Allocation

Layer sensitivity to key eviction is determined by measuring representational drift:

$\|KV\| = L \times H \times N$ 5

where $\|KV\| = L \times H \times N$ 6 and $\|KV\| = L \times H \times N$ 7 denote key states with and without block-prefill eviction, respectively. Sharpening sensitivities via exponent $\|KV\| = L \times H \times N$ 8 ( $\|KV\| = L \times H \times N$ 9– $N$ 0), the total KV token budget is allocated as

$N$ 1

assigning larger capacities to layers with higher sensitivity and thus maintaining QA performance (Kim et al., 22 Sep 2025).

4. Theoretical Guarantees and Empirical Results

Memory Guarantee: Peak KV cache size is always $N$ 2, ensuring constant GPU memory usage $N$ 3 independent of conversation length.
Empirical Accuracy Bound: Under compression ratios of $N$ 4– $N$ 5 ( $N$ 6), EpiCache achieves $N$ 7 of full-KV QA accuracy across benchmarks.
Latency and Memory Reduction: On LLaMA-3.2-3B (at $N$ 8K), per-turn latency is reduced by up to $N$ 9 (from $M_{\text{peak}} \approx O(N)$ 0 ms to $M_{\text{peak}} \approx O(N)$ 1 ms), and peak memory by up to $M_{\text{peak}} \approx O(N)$ 2 (from $M_{\text{peak}} \approx O(N)$ 3 GB to $M_{\text{peak}} \approx O(N)$ 4 GB).
Scalability: For contexts up to $M_{\text{peak}} \approx O(N)$ 5K tokens ( $M_{\text{peak}} \approx O(N)$ 6K), EpiCache approaches uncompressed performance, outperforming StreamingLLM, SnapKV, InfiniPot, KeyDiff, and KVzip-adapted baselines, which degrade at scale.

The following table summarizes benchmark results for EpiCache ( $M_{\text{peak}} \approx O(N)$ 7K, LLaMA-3.2-3B):

Benchmark	Accuracy Gain over Baseline	Compression Ratio	Latency Reduction (×)	Memory Reduction (×)
Realtalk, LoCoMo, LongMemEval	+20–40% absolute	4–6×	up to 2.4	up to 3.5

(Kim et al., 22 Sep 2025)

5. Implementation Details and Practical Considerations

Block Size $M_{\text{peak}} \approx O(N)$ 8: Larger $M_{\text{peak}} \approx O(N)$ 9 yields faster coverage at the expense of higher temporary memory; smaller $M$ 0 minimizes peak but slows prefill.
Episode Count $M$ 1: Higher $M$ 2 (more episodes) can increase recall under tight memory but raises offline cache storage requirements.
Embedding Strategies: Accuracy is not highly sensitive to the embedding model; compact encoders suffice for practical segmentation.
Retrieval Overhead: Embedding, centroid matching, and KV loading contribute to $M$ 3 of per-turn latency, as episode switches are infrequent.
Limitations: Fixed episode count $M$ 4 and offline clustering; no support for dynamic $M$ 5 or compressed/quantized KV storage. Performance depends on clustering quality and selection of $M$ 6.

Contemporary baselines such as StreamingLLM, SnapKV, InfiniPot, KeyDiff, KVzip, and context manipulation techniques (e.g., PrefixLM, position-independent caching as in EPIC (Hu et al., 2024)) also target memory-constrained LLM inference. Unlike query-dependent strategies that collapse all context to a single query (often harming multi-turn coherence), EpiCache preserves topic- and episode-relevant context and amortizes cache management across episodes, achieving superior QA accuracy and resource efficiency.

7. Limitations and Future Directions

Current constraints of EpiCache include the static selection of episode count $M$ 7, reliance on offline clustering, and the lack of quantized/compact KV representations. Future research directions proposed include dynamic adaptation of $M$ 8, integration of KV-quantization schemes, and improvements in conversational segmentation. These extensions may further reduce memory requirements and enable finer-grained, context-adaptive compression without sacrificing QA fidelity (Kim et al., 22 Sep 2025). A plausible implication is that incorporating quantization or dynamic clustering could push the efficiency-accuracy frontier further in long-context LLM serving.

Markdown Report Issue Upgrade to Chat

References (2)

EpiCache: Episodic KV Cache Management for Long Conversational Question Answering (2025)

EPIC: Efficient Position-Independent Caching for Serving Large Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EpiCache Framework.