Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Published 3 Dec 2025 in cs.LG and cs.AI | (2512.03324v1)

Abstract: Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.

Summary

  • The paper presents TRIM-KV, a novel method that dynamically computes token retention scores at creation and evicts low-importance tokens under memory constraints.
  • It employs retention-gated attention, distillation, and capacity loss to align with LLM outputs while reducing computational overhead.
  • Experimental results demonstrate that TRIM-KV outperforms traditional KV eviction strategies across diverse benchmarks, enhancing both efficiency and interpretability.

Token Retention for Memory-Bounded KV Cache in LLMs

Introduction

The paper "Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs" introduces TRIM-KV, an innovative method designed to address the challenges of memory and computation constraints in long-horizon LLM inference. The intrinsic quadratic cost of self-attention and the growing demand for key-value (KV) cache storage are identified as bottlenecks in efficient model deployment. Traditional methods like quantization and heuristic KV eviction tend to impose either high orchestration costs or rely on unreliable proxies of token importance based on attention.

TRIM-KV proposes a novel approach where each token's importance is evaluated at the time of its creation via a lightweight retention gate. The retention gate assigns a scalar retention score that reflects the long-term utility of a token for specific layers and heads. The score decays over time, leading to the eviction of low-importance tokens when memory limits are reached. This method ensures the KV cache maintains only the most crucial tokens, optimizing memory utilization and increasing efficiency especially in memory-constrained environments.

Methodology

Retention-Gated Attention: The paper introduces a mechanism called retention-gated attention. This involves computing a retention score β[0,1]\beta \in [0,1] for each token that decays exponentially. This decay mimics human memory by gradually reducing the weight of older tokens. A high β\beta score suggests the token has critical importance and needs longer retention, whereas a lower score indicates that the token will be quickly discarded.

Training Framework: The retention gates are trained using distillation from a frozen LLM combined with a capacity loss. This minimizes the inference overhead and allows the model to optimally manage the KV cache. By incorporating a distillation loss and next-token prediction loss, TRIM-KV ensures alignment with the original LLM's output distribution while exploring sparsity patterns.

Eviction Policy: The scoring system is integral to determining eviction policy, where tokens with minimal score β\beta are evicted when the cache exceeds the preset memory budget. This is computed using a simple score comparison, making the inference process straightforward and efficient. Figure 1

Figure 1: Patero frontiers of competing algorithms with different budgets on math benchmarks.

Experimental Results

The experimental results demonstrate significant improvements over competing KV eviction strategies in both controlled and less constrained memory settings. TRIM-KV exhibits superior performance across various tasks including mathematical reasoning datasets like GSM8K, MATH-500, and AIME24, as well as procedural generation and conversational benchmarks. Remarkably, it often surpasses full-cache models, revealing that selective retention can suppress noise from uninformative tokens.

TRIM-KV's designs naturally recover several heuristic strategies like sliding windows and gist compression without explicit instructions, indicating that learned retention scores align closely with human cognitive processes.

Qualitative Analysis

TRIM-KV's qualitative analysis shows that token retention scores serve as an insightful probe into LLM interpretability, reflecting layer- and head-specific dynamics. The model tends to assign high retention scores to task-critical tokens, while punctuation and filler tokens generally receive lower scores. Figure 2

Figure 2: a) Average retention scores across all layers and heads of Qwen3-4B on tokens of an AIME24 example. b) Top 10 tokens with the highest (left table) and lowest (right table) average retention. c) The layer- and head-wise sparsity level estimated by token retentions.

Implications for AI Development

The approach presented by TRIM-KV suggests practical and theoretical advancements in optimizing LLM memory use, especially for applications requiring prolonged inference under strict memory constraints. The notion of treating retention scores as proxies for token importance shifts the focus from attention-based dynamics to intrinsic attributes of token utility over long contexts.

These advancements could influence future developments in AI, offering pathways to more efficient architectures that seamlessly balance performance and computational resource use. Additionally, the insights into head-specific dynamics may foster enhancements in model interpretability and diagnostics.

Conclusion

The paper showcases TRIM-KV as a compelling strategy for memory-bounded LLM operations. Its core contribution lies in leveraging retention scores for adaptive token caching, ensuring models remain efficient without retraining from scratch. This method potentially paves the way for future explorations of intrinsic token importance and memory-efficient model design tailored for complex reasoning and long-context scenarios.

Whiteboard

Explain it Like I'm 14

Overview

This paper introduces a new way to help LLMs handle very long conversations or documents without running out of memory. The idea, called TRIM-KV, teaches the model to decide which past words or tokens are truly important to keep and which can be safely forgotten. This lets the model work well even when it has a strict memory limit.

What questions does the paper ask?

  • How can an LLM keep working well when it can only store a limited number of past tokens (memory slots)?
  • Can the model learn, in advance, which tokens will be useful later, instead of only looking at what it paid attention to recently?
  • Will this learned “keep-or-drop” strategy be fast, simple to use, and actually improve results on tough tasks?

How does the method work?

The problem: limited memory in LLMs

When an LLM reads or writes text, it stores summaries of past tokens (called a “KV cache”) so it can reuse them. But the more it reads, the more memory this cache needs. If the cache gets too big, the model slows down or runs out of space.

Think of the KV cache like a backpack with limited slots. As you keep walking (generating text), you add new items (tokens). If the backpack is full, you must decide which old item to remove to make room.

The idea: a “retention gate” and fading importance

TRIM-KV adds a tiny “retention gate” to each layer and head inside the model. When a token is created, this gate gives it a score between 0 and 1 that says how important it is to keep. Over time, this score slowly fades—like ink that gets lighter as time passes. Important tokens fade slowly; unimportant ones fade quickly.

When the backpack (KV cache) is full, the system removes the token with the lowest current score. This way, the cache always keeps the most useful tokens for the long run, with a natural preference for newer important tokens.

This fading score is inspired by how human memory works: we forget some things over time unless they’re strong or important.

Training the gates: learning from a teacher and a budget

To teach the retention gates how to score tokens well, the authors:

  • Use a frozen, original LLM as a “teacher” and train the gates so the new model’s outputs look like the teacher’s. This is called distillation.
  • Add a “capacity loss” that acts like a strict budget reminder. If the model tries to keep too many tokens, it gets penalized. This encourages smart pruning.

Importantly, only the small gate networks are trained; the main LLM stays unchanged. This keeps training fast and simple.

Using it during inference: simple, fast eviction

At run time, the model:

  • Assigns a retention score to each new token.
  • If the cache exceeds the memory limit, it evicts the token with the smallest current score.
  • Attention computation proceeds normally, with minimal extra cost.

This is much simpler than methods that move data between CPU and GPU or search through big caches.

What did they find?

The authors tested TRIM-KV on math problems, long procedural tasks, and very long chat contexts. They report that:

  • It beats popular “keep recent things” heuristics (like SnapKV, H2O, StreamingLLM) across many memory budgets, often by a large margin—especially when memory is tight.
  • It even outperforms a strong learned retrieval baseline that offloads memory to the CPU, without the overhead of moving data around.
  • Surprisingly, in some cases, it does better than keeping the full cache. That means dropping unhelpful tokens can act like a helpful “regularizer” that lowers noise.
  • The learned scores match human intuition. The model naturally:
    • Keeps “sink” tokens (important early tokens that set the topic).
    • Uses sliding windows when that helps.
    • Compresses the “gist” of text when appropriate.
  • The scores reveal different roles for layers and heads (parts of the model). Some prefer recent tokens; others keep numbers, variables, or sentence boundaries (like periods), which may work as mini-summaries.

Why is this important? It shows LLMs can use memory smarter, not just more. With the same or even less memory, the model can think better and faster.

Why it matters

  • Efficiency: TRIM-KV lets LLMs handle long contexts on smaller GPUs or less powerful machines, reducing cost.
  • Reliability: It avoids risky heuristics based only on recent attention, focusing on a token’s deeper, long-term usefulness.
  • Better results: In some tasks, selective memory beats using all memory by filtering out uninformative tokens.
  • Understanding models: The retention scores help us peek into how different parts of the model treat different kinds of tokens, which can improve interpretability.

Final thoughts and impact

TRIM-KV is a practical, lightweight upgrade for LLMs that helps them remember what matters and forget what doesn’t, all under a fixed memory limit. It can make long conversations, long documents, and step-by-step reasoning more efficient and sometimes even more accurate. In the future, this approach could be extended to multimodal inputs (like text plus images), tool use, or models trained from scratch with memory limits built in.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 25 likes about this paper.