Papers
Topics
Authors
Recent
Search
2000 character limit reached

Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning

Published 10 May 2026 in cs.CL, cs.AR, and cs.LG | (2605.09490v1)

Abstract: Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers -- HBM, DDR, compressed, and evicted -- using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3x3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV -- the current SOTA eviction method -- on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest -- 5-7% transfer overhead -- and scaling analysis projects 2-48 GB HBM savings at production batch sizes.

Authors (3)

Summary

  • The paper presents a four-tier memory hierarchy that decouples high-bandwidth memory constraints from inference accuracy through cumulative attention scoring.
  • The methodology partitions the KV-cache into HBM, DDR, compressed, and evicted tiers, achieving up to 91.5% GSM8K accuracy retention with reduced GPU memory usage.
  • Empirical results show that the approach outperforms standard eviction, reducing HBM occupancy by up to 50% while incurring only a 5–7% transfer overhead.

Semantics-Aware Memory Hierarchy for LLM Reasoning: Decoupling HBM Constraints from Accuracy

Problem Statement and Motivation

Chain-of-thought reasoning in LLMs incurs significant memory demands, especially on high-bandwidth memory (HBM) of GPUs, as each inference request appends thousands of key-value (KV) pairs to the cache. Existing approaches to alleviate HBM constraints, such as permanent eviction or precision compression, result in catastrophic losses in reasoning accuracy—eviction of 50% of tokens drops accuracy to near zero on benchmarks such as GSM8K, whereas compression methods are constrained by the precision sensitivity of reasoning tokens. The central question examined is whether all tokens must reside in HBM, or if selective placement across heterogeneous memory types can preserve accuracy without strict HBM residency.

Semantics-Aware Four-Tier Memory Hierarchy

The proposed method establishes a four-tier, semantics-aware memory hierarchy for KV-cache management, classifying tokens during generation based on cumulative attention scoring:

  1. T0 (HBM): Highest-importance tokens and protected regions (prompt, sinks, recent window) reside in GPU HBM.
  2. T1 (DDR): Medium-importance tokens are offloaded to CPU DDR memory at full precision, asynchronously prefetched back to GPU before each attention operation.
  3. T2 (Compressed): Low-importance tokens are stored in CPU memory at reduced precision, implemented via quantization (currently suboptimal due to fidelity loss).
  4. T3 (Evicted): Lowest-importance tokens are permanently discarded.

Tier assignments are recomputed every 64 steps (amortizing sorting cost), with token importance evaluated by cumulative attention history. Crucially, offloaded tokens in T1 and T2 are recalled at full precision so their exact contribution to attention computation is preserved, thus formalizing a zero-approximation-error guarantee on offloaded tokens.

Empirical Findings: Offloading vs. Eviction

Controlled experiments across three model scales (7B, 14B, 32B), four benchmarks (GSM8K, MATH-500, MATH Level-5, ARC-Challenge), and a 3x3 grid of HBM/eviction ratios demonstrate:

  • Accuracy depends solely on the eviction ratio, not the HBM ratio. At a fixed eviction ratio (e.g., 3%), accuracy across HBM budgets (30–70%) varies negligibly (within 4–6 percentage points, statistically insignificant).
  • Hierarchy outperforms pure eviction: With only 3% eviction, the hierarchy retains 91.5% of the GSM8K baseline accuracy and 71% on MATH-500, while standard eviction strategies collapse to single-digit accuracy retention.
  • Scale robustness: Hierarchy preserves accuracy across scales, matching or exceeding full-cache baseline accuracy with the 14B model (90% vs. 86%, halving HBM occupancy).

Strong numerical results include 65–91.5% retention of baseline accuracy at 3% eviction (GSM8K, n=200) and 2–48 GB HBM savings at production batch sizes, with real GPU-CPU movement confirming transfer overhead is limited to 5–7% of inference time.

Theoretical Analysis: Zero-Approximation-Error Offloading

The hierarchy's correctness rests on two results:

  • Offloaded tokens (T1) recalled at full precision contribute precisely the same terms in attention computation as if they had remained in HBM.
  • Only permanent eviction (T3) introduces approximation error, bounded by their total cumulative attention weight (intrinsically minimized by scoring).

This decouples memory residency from inference accuracy—a fundamentally different operating principle from previous eviction-based or compression-based approaches.

Hardware and System Considerations

PCIe transfer latency and bandwidth measurements indicate that offload/prefetch operations scale linearly with token count and saturate at ~22 GB/s (GPU→CPU), ~15 GB/s (CPU→GPU). Transfer overhead remains modest due to batched and differential prefetching; practical deployment scenarios show that KV cache memory can dominate GPU memory allocation (up to 70% for 70B models with int4 quantization). The hierarchy enables aggressive offloading to DDR, reducing HBM requirements and facilitating larger batch sizes or longer context windows without accuracy loss.

Comparison with Existing Approaches

Most prior KV-cache management (eviction, compression, offloading) employs binary keep/discard policies or operates solely within HBM. Contemporary reasoning-aware scoring methods (R-KV, TriAttention, ThinKV) achieve competitive accuracy but discard non-retained tokens, incurring irrecoverable information loss. Head-to-head experiments reveal that even state-of-the-art scoring cannot recover accuracy at conservative cache budgets: R-KV achieves only 32% accuracy at 1024-token budget compared to 56–62% for the hierarchy at equivalent eviction ratios.

Cumulative attention scoring outperforms alternatives (value norm and redundancy-based signals), with ablations indicating that value norm and redundancy introduce noise for reasoning tokens due to non-stationary referencing patterns.

Practical and Theoretical Implications

The hierarchy addresses the information density gradient across reasoning tasks—harder tasks and longer chains pack more referentially important tokens, demanding preservation rather than eviction. Practically, this enables single-GPU deployment for models previously requiring multi-GPU setups; theoretically, it reframes KV-cache management as a placement problem, not a compression or eviction problem. The hierarchy is orthogonal to scoring advances; it provides the infrastructure for future plug-in scoring algorithms, compression techniques, and cross-architecture generalization.

Future Directions

Further research is warranted on reasoning-aware compression for the T2 tier, cross-model and cross-architecture validation (e.g., Llama, Mistral), and optimization of PCIe bandwidth utilization in production systems. Pursuing segment-differentiated precision and leveraging improved importance signals may yield additional memory and compute gains. As LLMs scale in both model size and context length, memory hierarchies will become increasingly consequential.

Conclusion

This work demonstrates that for chain-of-thought LLM reasoning, accuracy is determined by the quantity of permanently discarded tokens—not their location in the memory hierarchy. By deploying a four-tier, semantics-aware memory hierarchy with real-time importance scoring, HBM occupancy can be reduced by up to half while preserving baseline accuracy. System prototype results confirm the approach’s scalability and minimal transfer overhead. The memory hierarchy paradigm stands as a rearchitecting of inference-time resource allocation, supporting sustained growth in context length and batch size without compromise in reasoning quality (2605.09490).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 16 likes about this paper.