PaFu-KV: Past & Future-Informed Cache Policy
- The paper demonstrates that integrating retrospective attention scores with prospective predictions reduces KV memory overhead in transformer models.
- It leverages methods like Golden Eviction, Expected Attention, and distillation-based salience estimation to optimize cache retention for long-context tasks.
- Empirical results show enhanced cache efficiency and throughput in applications ranging from language reasoning to multi-agent workflows and video generation.
A Past- and Future-Informed KV Cache Policy (PaFu-KV) is a class of cache management strategies for transformer-based sequence models that selectively retains or evicts key–value (KV) memory entries based on both retrospective (past) and prospective (future) indicators of their semantic or functional importance. This policy synthesizes information from attention patterns, temporal reuse predictions, and model-internal salience scoring to minimize memory and bandwidth consumption during inference or generation, especially in tasks demanding long-context reasoning, multi-agent orchestration, or autoregressive video synthesis. PaFu-KV policies are realized via a diverse set of mechanisms—ranging from supervised and reinforcement learning to closed-form attention expectation computation and distillation-based salience estimation—with broad applicability across language, workflow, and video domains.
1. Motivation and Core Challenges
In transformer models, the KV cache linearly accumulates key and value tensors with each generated token or timestep, quickly resulting in substantial memory (e.g., 4.5 GB per instance at 32K tokens in Qwen3-4B) and compute overhead. Naïve retention (no eviction) ensures full context but severely limits throughput and feasible batch size. Uninformed or heuristic eviction (e.g., FIFO, LRU, or position-based thresholds) frequently discards tokens that, while seemingly unimportant immediately, can be essential for maintaining long-term dependencies, semantic coherence, or agentic control (Dong et al., 3 Feb 2026, Pan et al., 10 Jul 2025, Chen et al., 29 Jan 2026). This is especially acute for:
- Long-sequence reasoning and math models: Complex dependencies across reasoning traces and head-specific semantic patterns cannot be captured by local recency/frequency alone.
- Multi-agent workflows: Hierarchical scheduling and data/control dependencies among agent invocations require anticipatory cache management to avoid recomputation (Pan et al., 10 Jul 2025).
- Autoregressive video diffusion: Sparse but persistent long-term spatiotemporal interactions between frames can be obscured by indiscriminate cache trimming, degrading temporal coherence (Chen et al., 29 Jan 2026).
A central challenge is the discrepancy between past and future importance—tokens largely ignored in past attention may be essential later, and purely forward-looking schemes are intractable or impossible in causal inference. PaFu-KV policies address this by integrating both retrospective and prospective signals.
2. Algorithmic Foundations and Principal Methods
2.1 Future-Informed Scoring: Golden Eviction and Expected Attention
Several works derive “future-informed” optimality by direct or estimated inspection of future attention:
- Golden Eviction computes an oracle eviction trace by leveraging future attention matrices across full (prompt + generation) traces. Each candidate KV pair is scored by the maximal pooled attention it will receive at any subsequent step; the policy retains the top-K pairs under a memory budget (Dong et al., 3 Feb 2026).
- Expected Attention (Devoto et al., 1 Oct 2025) resolves the inaccessibility of future query vectors by modeling their distribution as Gaussian, then calculating the closed-form expected attention weight using the moment-generating function:
This yields a normalized attention proxy for each key and enables compression without direct access to future tokens.
PaFu-KV generalizes these by combining such future-oriented proxies with past-statistics or by incorporating them in learning-based eviction policies.
2.2 Past-Informed Components
To capture retention value based on realized attention history, PaFu-KV also maintains:
- Running or aggregated past attention scores: E.g., exponentially moving averages or block maxima over (historical attention paid to key ) (Devoto et al., 1 Oct 2025).
- Statistics derived from kernel density estimation or frequency/recency predictions: As in DEAP Cache, which models both future and historical cache characteristics in a multi-task LSTM pipeline (Mangal et al., 2020).
The fusion may take the form , with hyperparameterized or head-adaptive.
2.3 Distillation and Salience Estimation
In non-language domains, PaFu-KV can be instantiated by distilling token- or frame-level salience from a bidirectional teacher into a lightweight student module:
- Salience Estimation Head (SEH): Trained to align its predicted salience with block-wise teacher attributions computed from full-sequence self-attention matrices; used to select the top-K tokens for cache retention (Chen et al., 29 Jan 2026).
3. Learning Frameworks: Supervision and Reinforcement
3.1 Supervised Distillation
Policies such as Golden Eviction are distilled into parameterized predictors (e.g., compact MLPs) which learn to estimate contribution scores or eviction eligibility using features extracted from KV tensors and attention-derived statistics (Dong et al., 3 Feb 2026). Training employs pairwise ranking loss enforcing oracle-order equivalence among candidate KV pairs.
3.2 Online Policy Optimization
Eviction is also cast as a Markov Decision Process (MDP) to accommodate distributional drift between training (oracle labels) and actual inference dynamics:
- States: Current retained KV indices.
- Actions: Selection of which KV indices to keep under the budget.
- Rewards: Penalties such as mean-squared increase in modeling loss, particularly on low-entropy tokens (bottom 80%) which are most prone to loss spikes if crucial context is discarded.
The Guarded Regularized Policy Optimization (GRPO) algorithm is employed to refine the supervised policy, using clipped surrogate objectives and KL-divergence regularization for stable improvement (Dong et al., 3 Feb 2026). Similar approaches appear in DEAP Cache with regret-minimization between frequency- and recency-based “experts” (Mangal et al., 2020).
4. Structural and Computational Considerations
4.1 Data Structures
- Radix-tree (trie) caches: Employed in multi-agent workflows, where node- and subtree-based eviction priorities reflect the minimal steps-to-execution among all agents depending on the prefix (Pan et al., 10 Jul 2025).
- Fine-grained, per-node eviction scores: Calculated from anticipated “distance to reuse” and used to efficiently evict cache branches while maintaining coordination across concurrent workflows.
4.2 Complexity and Overhead
- Algorithmic steps, including priority queue updates and status tracking for overlapping CPU→GPU prefetch, are bounded by per operation, with negligible metadata (<1% of raw KV usage).
- Scoring operations (e.g., MLP forward passes, expected-attention contraction) dominate the per-compression cost but are optimized to scale linearly or nearly linearly with cache size and model dimension (Devoto et al., 1 Oct 2025, Dong et al., 3 Feb 2026).
5. Applications and Quantitative Results
5.1 Language and Reasoning Models
PaFu-KV (as instantiated by ForesightKV) yields state-of-the-art efficiency-accuracy trade-offs: With only half the cache budget, it preserves ≥90% of full-context Pass@1 rates on math benchmarks (e.g., on Qwen3-4B + AIME2024: 1K PaFu-KV achieves 54.5 vs. 44.8 for 2K R-KV). Throughput improvements of up to 9.8× at 32K context are reported, with eviction overhead below 3% of total inference time (Dong et al., 3 Feb 2026).
5.2 Multi-Agent LLM Workflows
PaFu-KV, as implemented in KVFlow, realizes 1.8–2.9× reductions in prefill latency compared to LRU or HiCache, secures near-perfect cache hit rates for shared prefixes, and supports high concurrency with tunable prefetching aggressiveness (Pan et al., 10 Jul 2025).
5.3 Video Generation
In real-time AR video diffusion, PaFu-KV reduces KV cache size by 30–50%, with only a ~0.2% drop in generation quality, while improving metrics such as subject consistency (90.1% → 93.9%) and halving quality drift on long-horizon sequences (Chen et al., 29 Jan 2026).
5.4 General Cache Compression
Expected-Attention–based PaFu-KV methods, even in a training-free setting, closely track or surpass the uncompressed baseline across a variety of LLM architectures and benchmark suites (e.g., Qwen3-8B, Gemma3-12B), outperforming previous attention-score and value-norm heuristics at aggressive compression ratios (Devoto et al., 1 Oct 2025).
6. Variants, Limitations, and Open Directions
- Scoring model specificity: Supervised and distillation-based policies are generally trained per model; cross-model transfer is an open question (Dong et al., 3 Feb 2026).
- Limited gain under hardware bottlenecks: When attention kernel memory costs are dwarfed by parameter size (e.g., with FlashAttention), further cache compression gives diminishing returns (Chen et al., 29 Jan 2026).
- Dynamic head/layer allocation: Static or globally uniform retention budgets may underutilize potential savings; adaptive, head-specific policy learning is suggested as a future avenue.
- Extensions to multimodal/retrieval-augmented settings: Handling heterogeneity in KV entry origin, modality, and reuse patterns remains a challenge (Dong et al., 3 Feb 2026).
- Combination with speculative decoding and compression: There is no published solution to end-to-end optimization uniting these axes, though synergistic approaches are plausible.
7. Summary Table: Key PaFu-KV Method Characteristics
| Method/Paper | Salience Basis | Learning Paradigm | Primary Domain |
|---|---|---|---|
| ForesightKV (Dong et al., 3 Feb 2026) | Max future attention | Supervised + RL | Reasoning LLMs |
| KVFlow (Pan et al., 10 Jul 2025) | Steps-to-execution | Heuristic/FM | Multi-Agent LLM flows |
| Expected Attention + Past Proxy (Devoto et al., 1 Oct 2025) | Future expectation + Past mean | Training-free | General LLM cache |
| DEAP Cache (Mangal et al., 2020) | Frequency, recency | Supervised + Hedging | General ML caches |
| Salience Estimation Head (Chen et al., 29 Jan 2026) | Teacher-distilled | Distillation | Video diffusion |
In sum, PaFu-KV denotes a family of cache eviction policies—rooted variously in supervised learning, reinforcement learning, closed-form probabilistic estimation, and distillation—that unify forward- and backward-looking signals to optimize cache footprint subject to maintained task performance. Their practical efficacy is established across language, agentic workflow, and video domains, with ongoing research addressing model adaptability, multimodal settings, and theoretical underpinnings.