Papers
Topics
Authors
Recent
Search
2000 character limit reached

SkipKV Framework: Efficient KV Cache Compression

Updated 30 December 2025
  • SkipKV framework is a training-free method for compressing key-value caches in large reasoning models, achieving significant memory and throughput gains.
  • It employs a sentence-scoring metric and sentence-level eviction to detect and remove redundant content while preserving semantic coherence.
  • An adaptive activation steering mechanism directs model outputs toward concise, execution-focused reasoning, enhancing efficiency and performance.

SkipKV (Selective Skipping of KV Generation and Storage) is a training-free framework for compressing key-value (KV) caches and reducing redundant reasoning during inference with large reasoning models (LRMs), especially under chain-of-thought (CoT) prompting. SkipKV achieves memory and throughput gains by segmenting model outputs at the sentence level, applying a redundancy-aware eviction policy, and dynamically steering model generation toward concise and execution-focused responses. The framework is compatible with standard decoder architectures, imposes no requirement for additional training, and attains significant improvements in efficiency metrics relative to contemporary KV compression approaches (Tian et al., 8 Dec 2025).

1. Sentence-Scoring Metric for Redundancy Detection

SkipKV identifies semantically redundant content by embedding sentences within generated text and measuring inter-sentence similarity with cosine distance. For a batch with last-layer activations HRbs×N×dH \in \mathbb{R}^{bs \times N \times d} (batch size bsbs, sequence length NN, hidden dimension dd), the output sequence is tokenized into sentences based on punctuation/newline delimiters (such as ".\n", ")\n\n"). Each sentence ii has a token span [bi:ei][b_i: e_i]; its mean embedding is computed as:

vi=mean(H[k]bi:ei),viRdv_i = \text{mean}\left(H[k]_{b_i:e_i}\right), \quad v_i \in \mathbb{R}^d

Sentence embeddings are 2\ell_2-normalized. Pairwise Sentence Similarity (PSS) is defined as:

PSS(vi,vj)=vivjPSS(v_i, v_j) = v_i^\top v_j

A sentence ii is flagged as redundant relative to a later sentence jj if PSS(vi,vj)τPSS(v_i, v_j) \geq \tau for a high threshold τ\tau (typically $0.95$). Empirically, incorrect CoT traces exhibit 1.7×\times more high-similarity sentences than correct ones. Redundancy removal targets earlier sentences to preserve recent, potentially more accurate reasoning while reducing cache growth (Tian et al., 8 Dec 2025).

2. Sentence-Level KV Cache Eviction Algorithm

SkipKV performs coarse-grained, sentence-level eviction of K/V entries from the cache, executing every Δt\Delta_t decoding steps (default: 128 tokens) to enforce a target cache budget. The algorithm maintains mappings from generated sentences to current cache spans, updates these mappings at each eviction step, and computes final token-wise scores as a weighted combination of attention importance, K-vec similarity, and redundancy (“λ\lambda”) indicators:

  • For each attention head hkh_k and token tt, compute:
    • Iαhk(t)I_\alpha^{h_k}(t): normalized attention importance (over sliding window α\alpha)
    • Rhk(t)R^{h_k}(t): mean cosine similarity among K-vectors
  • Aggregate score:

Ifinal(t)=σIα(t)(1σ)R(t)[λ if redundant sentence]I_{\text{final}}(t) = \sigma \cdot I_\alpha(t) - (1 - \sigma) \cdot R(t) - [\lambda \text{ if redundant sentence}]

with tradeoff hyperparameter σ=0.1\sigma=0.1.

Sentences with λ0.95\lambda \approx 0.95 are prioritized for removal (dominating token-level scores 0.1\sim0.1), ensuring that their complete token spans are pruned first. This strategy prevents semantic fragmentation—a limitation of token-wise eviction methods—and maintains logical consistency in mathematical or code-related CoT traces.

Table: Default SkipKV Hyperparameters

Parameter Symbol Value/Range
KV-eviction interval Δt\Delta_t 128 tokens
Redundancy threshold τ\tau 0.95
Importance/redundancy tradeoff σ\sigma 0.1
Cache budget BB 20%–50% of FullKV

Batch grouping by prefill length reduces padding-induced effective budget loss, yielding further accuracy improvements.

3. Adaptive Activation Steering Mechanism

SkipKV reduces repetitive, non-execution reasoning (e.g., reflection, validation loops) by injecting a learned “steering vector” into the model’s hidden activations at each decoding step. For 500 annotated CoT traces, tokens are labeled as execution (“E”) or non-execution (“O”). The steering vector is computed as the difference between average hidden states for E and O tokens:

V=μEμOV = \mu_E - \mu_O

During inference, at each step tt and designated decoder layer kk, activation is updated:

HkHk+αtVH_k \leftarrow H_k + \alpha_t V

where αt=α0+γNo\alpha_t = \alpha_0 + \gamma N_o, with α0\alpha_0 (default: 1.0), γ\gamma (default: 0.02), and NoN_o the count of prior non-execution sentences. This mechanism adaptively increases steering strength when excessive non-execution content is detected, biasing the model toward concise, action-oriented reasoning steps. A plausible implication is that steering may improve both computational efficiency and result clarity across diverse CoT tasks (Tian et al., 8 Dec 2025).

4. Memory Footprint, Throughput, and Computational Analysis

SkipKV constrains cache memory scaling: instead of the O(Td)O(T \cdot d) growth with total generation length TT in standard FullKV caching, SkipKV enforces a fixed cache size O(Bd)O(B \cdot d), where BTB \ll T is the sentence-pruned budget. Empirical results show up to 6.7×6.7\times cache compression (AIME-24, Qwen-14B) and $2$–4×4\times compression across other benchmarks without significant drops in accuracy.

The reduction in generated token length (by up to 48%48\% vs. FullKV) results in fewer expensive forward passes. Throughput gains are substantial; for GSM8K on Qwen-7B, end-to-end speed improves by up to 9.6×9.6\times over FullKV and 1.7×1.7\times over state-of-the-art token-wise R-KV at equivalent budget.

Table: Throughput (samples/min) for GSM8K, Qwen-7B (A100-40GB)

Method Batch=10 Batch=50 Batch=100
FullKV 4.1 2.6
R-KV 5.8 13.7 20.0
SkipKV 9.7 22.7 25.4

5. Empirical Evaluation and Ablation Studies

SkipKV has been benchmarked on DeepSeek-R1-Distill-Qwen-7B, Qwen-14B, and Llama-8B across mathematical (MATH-500, AIME-24, GSM8K) and code-generation (LiveCodeBench) datasets, with sequence limits of 8K–16K tokens per batch.

Key findings:

  • Accuracy Retention: SkipKV sustains FullKV-level pass@1 accuracy at only 15–20% of original cache budget. Competing methods (e.g., R-KV, H2O) deteriorate below 50% budget.
  • Token Reduction: Generated sequence lengths are cut by up to 28% (vs. FullKV) and 32–48% vs. R-KV under similar compression budgets.
  • Ablation: Adding sentence scoring and adaptive steering increases accuracy by 13.3 pp and reduces token output by up to 32% at 27% budget. Full SkipKV with batch grouping delivers up to +20 pp accuracy and −30% tokens compared to R-KV at equivalent cache allocation.
  • Batch Grouping: Sorting samples by prefill length into homogeneous batches minimizes padding inefficiency; e.g., on MATH-500 at 26% nominal budget, usable budget rises (21→25%) and accuracy improves by +5.6 pp (77.8→83.4%).
  • Qualitative Analysis: Token-wise eviction (R-KV) risks logically damaging removals (e.g., number fragments inside key steps), whereas SkipKV’s sentence-level scope preserves semantic and logical coherence.

6. Implementation and Usage Considerations

SkipKV is designed as a training-free, inference-time enhancement. It operates directly on last-layer decoder hidden states, without external sentence transformers. The algorithm integrates with contemporary decoder and attention implementations such as FlashAttention-2. Key defaults:

  • Hyperparameters: Δt=128\Delta_t=128, τ=0.95\tau=0.95, σ=0.1\sigma=0.1, α0=1.0\alpha_0=1.0, γ=0.02\gamma=0.02
  • Steering layer: 20 (7B/8B models), 35 (14B models)
  • Prerequisite sentence segmentation is punctuation-based and may require domain-specific adjustments; adaptation to free-form generative text may reduce segmentation fidelity.
  • KV quantization is not addressed by SkipKV and remains orthogonal.

Limitations are mostly configuration-dependent—thresholds and steering strength may need to be retuned for out-of-domain or multimodal tasks. SkipKV does not require any modification to model weights.

7. Relation to SwiftKV and Broader KV Compression Approaches

While SkipKV focuses on inference-time, training-free sentence-level cache manipulation and generation steering, related work includes SwiftKV (Qiao et al., 2024). SwiftKV targets the prefill (prompt) phase in Transformer-based LLMs, offering a model-level transformation (SingleInputKV) to eliminate forward computation in higher layers for prompt tokens and a corresponding micro-distillation step to preserve generation quality. SwiftKV can be combined with grouped KV caching and quantization for up to 4×4\times memory savings, as well as significant FLOP reductions.

Contrasts include SkipKV’s unique leveraging of dynamic redundancy scoring, sentence-scope evictions, and adaptive activation modulation, versus SwiftKV’s prompt-focused, knowledge-distilled computational bypass of later layers. Both approaches contribute to the rapidly evolving landscape of efficient large-model inference, with SkipKV particularly suited for scenarios with verbose, reasoning-centric chains of thought.


SkipKV establishes a robust paradigm for inference-time LLM efficiency by combining coarse-grained, content-aware eviction, dynamic response steering, and batch-preprocessing, delivering state-of-the-art compression and speedups while maintaining or improving accuracy across complex reasoning tasks (Tian et al., 8 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SkipKV Framework.