KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks
Abstract: Test-time scaling is a powerful approach to obtain better reasoning in LLMs, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
KVarN: Making AI โMemoryโ Smaller without Messing Up Its Thinking
Overview (What this paper is about)
This paper is about a way to shrink the โmemoryโ that LLMs use while they are thinking and writing long answers. That memory is called the KV-cache. The new method, called KVarN, compresses this memory very aggressively (down to about 1/7 the usual size) but keeps the modelโs reasoning quality almost the same. It especially helps on long, step-by-step problems like math, coding, and following detailed instructions.
Key Questions the Paper Tries to Answer
- Can we compress the KV-cache to save lots of memory during long generations, without making the modelโs answers worse?
- Why do current compression methods break down when the model generates text step by step (instead of reading a long prompt all at once)?
- What kinds of compression mistakes hurt the most, and how can we prevent them?
How They Did It (Methods, in simple terms)
First, a quick idea of terms:
- KV-cache: A modelโs short-term memory while it generates text. It stores โkeysโ (K) and โvaluesโ (V) that help the model pay attention to earlier words.
- Quantization: Squeezing numbers into fewer bits (like turning a full-color photo into a small file with fewer colors). It saves memory but can add small errors.
Whatโs the problem?
- Many methods are tested only when the model reads a long prompt once (โprefillโ), not when it generates text step by step.
- During generation, errors can pile up over time. Think of photocopying a photocopy again and again: small mistakes accumulate.
What kind of error is worst?
- Each token (word piece) has a โscaleโ or โloudnessโ (its overall size). The paper shows that the biggest mistakes come from getting this scale wrong for some tokens, not from tiny twists in direction. These โscaleโ mistakes are rare but very harmful.
Whatโs KVarNโs trick? KVarN uses two simple moves that work together:
- Hadamard rotation: Imagine shuffling and spreading information evenly across channels so no single channel is too โspiky.โ This makes the data easier to compress.
- Variance normalization in two directions (dual-scaling): Imagine turning volume knobs so every row (per token) and every column (per channel) has a balanced volume. This keeps each tokenโs โloudnessโ right, which prevents those harmful scale mistakes.
How do they test it?
- Besides standard tests, they introduce a โpseudo-decodeโ setup that mimics real generation: after every small block of tokens (e.g., 128), they compress the KV-cache and then keep going. This shows how errors actually accumulate over time during long reasoning.
Main Findings (What they discovered and why it matters)
- Scale mistakes dominate the worst errors: The most damaging errors come from getting a tokenโs overall size wrong. Fixing these outliers helps more than reducing lots of small, harmless errors.
- KVarN reduces error accumulation: In the realistic, step-by-step setting, KVarN keeps errors from piling up as much as before.
- Strong results at very low precision (about 2.3 bits per value): On tough generative benchmarks (MATH500, AIME24, HumanEval, IFEval), KVarN matches or beats previous methods while using roughly 1/7 the memory of standard 16-bit storage.
- Tiny speed cost: The extra work (the normalization step and using two scales) adds well under 1% overhead in practice, which is negligible compared to generation time.
Here are the kinds of tasks where KVarN shines:
- MATH500 and AIME24: Competition-style math that needs long chains of thought.
- HumanEval: Writing correct Python code from problem descriptions.
- IFEval: Following strict instructions (like exact formats and length limits).
Why This Matters (Implications)
- Longer, better thinking within the same memory: With KVarN, models can handle longer reasoning without running out of memory or slowing down too much.
- Cheaper deployment: Using less memory means running strong models on smaller or fewer GPUs, making advanced reasoning more accessible.
- More reliable long answers: By preventing error build-up, KVarN helps models stay accurate over long, step-by-step solutions (like solving multi-step math or writing multi-function code).
- Plays well with real systems: The method fits into popular serving frameworks (they even provide a vLLM implementation), making it practical to adopt.
In one sentence
KVarN keeps the modelโs per-token โvolumeโ correct while compressing, which stops small mistakes from snowballing, letting LLMs think longer and better using much less memory.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of what remains uncertain or unexplored in the paper, phrased to guide concrete future research.
- Generalization to larger and diverse architectures:
- Does KVarNโs advantage persist on substantially larger dense models (e.g., 34Bโ70B+) and on MoE models, grouped/multi-query attention, or sliding-window/local attention variants?
- How does KVarN interact with architectural features such as ALiBi, different RoPE variants, cross-attention (for multi-modal models), or architectures without a KV cache (SSMs), beyond the brief limitation note?
- Robustness across tasks and data distributions:
- Performance under long-form generation beyond reasoning/coding (e.g., summarization, long conversational dialogs, RAG-heavy workflows, multilingual settings) is not evaluated.
- Sensitivity to domain shifts and adversarial/atypical token statistics (e.g., code-mixed inputs, highly repetitive or extremely sparse contexts) remains unknown.
- Pseudo-decode vs real decoding:
- The proposed โpseudo-decodeโ evaluation is a proxy; its quantitative correlation with true online decoding outcomes (under stochastic sampling, branching, and beam search) is not established.
- Impact under test-time scaling with many parallel branches (self-consistency / tree-of-thought) on both quality and throughput is unmeasured.
- Rateโdistortion and precision trade-offs:
- No full rateโdistortion curves across bitwidths (1โ4 bits) or mixed-precision schedules; results focus mainly on 2-bit. How does KVarN compare at other operating points?
- Comparative fairness and robustness across baselines with matched memory budgets (e.g., TurboQuant config that doesnโt leave layers unquantized) is not exhaustively studied.
- Sensitivity to implementation hyperparameters:
- Effect of group size (G), sink tokens (S), and trailing unquantized tokens (R) on quality/latency/memory is not systematically explored (e.g., G โ {32, 64, 256}, S minimal viable value).
- Convergence behavior of the iterative variance normalization: minimum iterations needed, adaptive stopping criteria, and numerical stability across models/layers.
- Influence of scale/zeropoint precision (e.g., FP8 format variants E4M3 vs E5M2, FP16 vs INT8) and quantization of auxiliary parameters on end quality.
- Theoretical understanding and guarantees:
- Formal analysis linking per-token magnitude errors to perturbations of attention logits/softmax and to downstream task degradation is largely empirical; theoretical bounds on error accumulation over layers/time are absent.
- Conditions under which magnitude errors dominate directional errors (model- and data-dependent) are not characterized.
- V-cache behavior and per-head heterogeneity:
- The paper emphasizes K-magnitude errors; a deeper analysis of V quantization errors, their accumulation, and head-wise/layer-wise heterogeneity in sensitivity is missing.
- Potential for per-head or per-layer adaptive scaling/bit allocation to target especially sensitive heads is unexplored.
- Alternative transforms and normalizations:
- Comparison with learned/structured rotations (e.g., learned orthogonal, butterfly) or token-axis transforms beyond Hadamard is not provided.
- Exploration of alternative normalization objectives (e.g., Lp normalization, whitening, robust/Huber scaling, kurtosis control) and their impact on outliers is absent.
- Online re-normalization of previously stored cache segments (periodic re-scaling of older tokens) as a means to control drift/accumulation is not investigated.
- Interaction with other compression strategies:
- Empirical combinations with eviction/token-merging methods (H2O, SnapKV, PyramidKV, KVZip, CaM, D2O) are not reported, despite orthogonality; best-practice recipes to co-tune parameters across methods remain open.
- Compatibility with train-time KV compression methods (e.g., MLA) and their quantization dynamics is unclear.
- Practical deployment and systems aspects:
- End-to-end serving throughput/latency under realistic loads (multi-request batching, paged attention, heterogeneous sequence lengths) with 2-bit KV caches is not measured due to framework support gaps.
- Memoryโcompute trade-offs at scale (e.g., GPU concurrency limits, cache eviction policies, NUMA/CPU-offload scenarios) and the impact of extra dequantization scale on kernels are not benchmarked across hardware.
- Decoding behavior and calibration:
- Effects on probability calibration, log-likelihood/perplexity, and decoding stability (e.g., repetition, exposure bias) are not reported.
- Influence of stochastic vs deterministic rounding schemes and of temperature/top-p settings on error accumulation and quality is not analyzed.
- Reproducibility and comparability details:
- Strong baselines may rely on community implementations (e.g., TurboQuant with unquantized layers), complicating apples-to-apples comparisons; standardized evaluation with controlled budgets is needed.
- Public datasets used for error diagnostics (e.g., wikitext-2 subsets) may not capture distributions encountered in long-horizon reasoning; broader diagnostics could improve external validity.
Practical Applications
Immediate Applications
The following applications can be deployed now by integrating KVarNโs variance-normalized KV-cache quantization (Hadamard rotation + dual-axis scaling) into existing transformer-based LLM inference pipelines. These leverage demonstrated nearโFP16 quality at ~2.3 bits/element and negligible runtime overhead (~0.18% for normalization; ~1% for dual-scale dequantization).
- Cloud LLM serving and APIs (software, cloud)
- Use case: Increase context length, number of parallel chains-of-thought, or request concurrency at the same GPU memory budget by storing KV caches at ~2.3 bits/elem with minimal quality loss (AIME24, MATH500, HumanEval, IF-Eval).
- Tools/workflows: vLLM implementation of KVarN; memory planners that set sink/body/trailing-token partitions (e.g., S=128, G=128, R=128); attention kernels that apply per-head Hadamard rotation and dual scaling; โpseudo-decodeโ evaluation to qualify models for production.
- Assumptions/dependencies: Transformer models with standard KV caches; ability to add Hadamard transforms (some absorbed into weights) and dual scales per group; serving stack with INT2 storage support (custom or patched kernels); performance overhead remains negligible on target hardware.
- Test-time scaling for reasoning agents (software, robotics, automation)
- Use case: Run more branches/rollouts (self-consistency, tree/graph/forest-of-thought) and longer chains without OOM, reducing error accumulation across timesteps in long-horizon decoding.
- Tools/workflows: Agent orchestration that budgets memory based on quantized KV; schedule KVarN per layer/head; combine with parallel decoding or speculative strategies; pseudo-decode regression tests.
- Assumptions/dependencies: Reasoning workloads produce long decode phases; agent stack tolerates the small dequant overhead; quality is monitored with accumulate-aware metrics.
- RAG and long-context retrieval/chat (enterprise, education, healthcare, finance)
- Use case: Fit more retrieved passages, longer documents, or compliance/policy contexts into prompts without increasing GPU count; improve instruction adherence (IF-Eval) and long-context retrieval accuracy.
- Tools/workflows: Retriever re-tuning to exploit larger effective context; context packaging policies that assume 2-bit KV cache; service-level objectives updated for longer prompts.
- Assumptions/dependencies: Data and governance teams validate that accuracy is near-FP16 for target domains; serving framework supports modified attention kernels; privacy/security controls unchanged.
- IDE/code assistants and CI bots (software engineering)
- Use case: Maintain near-FP16 HumanEval performance while expanding in-file and repository context windows (e.g., larger diffs, more files, longer code histories).
- Tools/workflows: IDE plugins or CI runners backed by a KVarN-enabled model endpoint; memory budgeting that scales with file count.
- Assumptions/dependencies: Endpoint integrates KVarN; code tasks benefit from longer decode and context rather than only prefill.
- On-device and edge assistants (mobile, embedded, robotics)
- Use case: Enable longer conversations and task contexts on memory-limited devices by compressing KV cache to 2-bit; reduce DRAM traffic for better latency/energy in long interactions.
- Tools/workflows: Mobile/edge inference stacks (e.g., custom runtimes) with KVarN kernels; per-chunk (e.g., 128 tokens) online variance normalization.
- Assumptions/dependencies: Device NPUs/GPUs support fast Hadamard and scaled dequant; storage at 2-bit and FP8/FP16 scales is feasible; models are transformer-based with KV caches.
- Academic and benchmarking practice (academia)
- Use case: Adopt โpseudo-decodeโ evaluation to measure KV error accumulation; use KVarN as a state-of-the-art baseline for KV-cache quantization studies; analyze outlier/magnitude-driven error contributions.
- Tools/workflows: Open-source vLLM implementation; ablation harnesses that track magnitude vs. directional error, and quantile contributions to end-to-end metrics (e.g., KL divergence).
- Assumptions/dependencies: Community benchmarks incorporate decode-accumulation settings; reproducibility via public code/models.
- Cost and energy optimization initiatives (industry policy/ops)
- Use case: Reduce GPU memory footprint and associated memory-bandwidth energy during long decode; lower cost per token for reasoning-heavy workloads.
- Tools/workflows: Capacity planning that assumes ~7ร KV compression (16bโ~2.3b); energy reporting that attributes savings to reduced KV traffic.
- Assumptions/dependencies: Realized energy/cost savings depend on workload mix and memory boundness; organizational acceptance of low-bit caches for production.
Long-Term Applications
These opportunities require broader ecosystem support, further research, or integration work (e.g., kernel/driver support, training-time co-design, and cross-architecture generalization).
- Native 2-bit KV-cache support across serving stacks and hardware (software, hardware)
- Use case: Standardize INT2 storage formats/kernels in CUDA/ROCm, TensorRT-LLM, TGI, and other frameworks; hardware-friendly Hadamard ops and dual-scale dequantization.
- Potential products: Turnkey โlow-bit KVโ mode in major serving frameworks; accelerator libraries with fused Hadamard+dequant kernels.
- Assumptions/dependencies: Vendor buy-in; kernel/compiler optimizations; broad testing across GPUs/NPUs.
- Precision scheduling and hybrid compression (software)
- Use case: Dynamically combine KVarN with eviction/token-merging (SnapKV, PyramidKV, H2O, KVZip, CaM, D2O) for >10ร effective context extension under tight memory.
- Potential workflows: Layer-wise or time-varying precision schedules; mixed K/V precision tuned to attention patterns and error accumulation.
- Assumptions/dependencies: Robust controllers to avoid quality cliffs; telemetry to detect outlier errors in-flight.
- Training-time co-design (academia, foundation model providers)
- Use case: Quantization-aware finetuning or pretraining that internalizes Hadamard/dual-scaling structure; learned rotations; layers regularized for magnitude stability to suppress outliers.
- Potential products: โKV-quant-readyโ checkpoints; adapters that minimize magnitude-driven outliers.
- Assumptions/dependencies: Access to training pipelines and data; compute budget; evidence that train-time changes further reduce accumulation.
- Extension to newer attention variants and architectures (software, robotics, multimodal)
- Use case: Adapt KVarN-like variance normalization to KV caches in vision transformers, video-LLMs, and agentic/robotic planners with long-horizon memory; explore compatibility with MLA and other train-time KV compression schemes.
- Potential tools: Cross-modal cache compressors; robot stacks with long-memory budgets.
- Assumptions/dependencies: Similar cache structures are present; interactions with MLA or alternative cache designs are favorable.
- Ultraโlong-context pipelines and โelastic contextโ services (cloud)
- Use case: Offer context-on-demand (e.g., 1โ10M tokens) by layering KVarN with eviction/merging and segment-wise rolling caches; intelligent paging across GPUs/CPUs.
- Potential products: โElastic context serversโ that trade precision vs. length at runtime; SLAs tied to decode-accumulation metrics.
- Assumptions/dependencies: Stable algorithms for very long horizons; robust paging and failure handling.
- Standardized evaluation and compliance frameworks (policy, industry consortia)
- Use case: Create decode-accumulation-aware benchmarks and certification for low-bit KV caches (akin to MLPerf segments) to ensure reliability in regulated sectors (healthcare, finance, gov).
- Potential workflows: Compliance reports that include outlier-error analyses; standardized pseudo-decode protocols.
- Assumptions/dependencies: Community consensus on metrics and task suites; third-party evaluation infrastructure.
- Broader activation compression (software, hardware co-design)
- Use case: Extend dual-axis variance normalization to other activations beyond KV (e.g., intermediate attention outputs) where accumulation matters, with low overhead.
- Potential products: General โvariance-normalized activation compressionโ libraries; fused kernels in accelerators.
- Assumptions/dependencies: Careful QoS validation to avoid compounding errors; workload-specific tuning.
- Privacy-preserving on-prem and offline reasoning (enterprise, public sector)
- Use case: Maintain FP16-like reliability for confidential, long-context tasks (e.g., legal discovery, medical case summaries) on fewer or smaller GPUs by compressing KV caches.
- Potential products: Secure, memory-optimized on-prem solutions; offline laptop/edge deployments for long-form analytics.
- Assumptions/dependencies: Validation on domain data; governance sign-off; hardware/kernel readiness for 2-bit KV at scale.
Glossary
- AIME24: A competition-level mathematics benchmark used to evaluate long-form reasoning in LLMs. "including MATH500, AIME24 and HumanEval, at 2-bit precision."
- attention logits: The pre-softmax scores produced by attention mechanisms, determining how much each token attends to others. "impacts the attention logits."
- attention sink: Special initial tokens retained at high precision to stabilize attention during decoding. "S sink tokens in FP16 (preserving attention sink behavior)"
- autoregressive decoding: Token-by-token generation where each new token conditions on previously generated tokens. "errors behave differently under autore- gressive decoding."
- Bernoulli-based masking: A stochastic masking strategy where elements are dropped/kept according to Bernoulli trials. "applies a Bernoulli-based masking process to merge value states within the KV cache during long-sequence generation."
- calibration-free: A quantization approach that does not require calibration data or activation samples. "We introduce KVarN, a calibration-free KV-cache quantizer"
- codebook-based methods: Quantization methods that represent data using indices into a learned set of prototype vectors (a codebook). "codebook-based methods like TurboQuant [33]"
- dual-scaling: Applying separate scaling along two tensor axes (e.g., token and channel) to normalize variances. "We identify dual-scaling with Sinkhorn-based variance-normalization as an effective approach to further mitigate token scaling errors."
- Hadamard rotation: Rotating data using a Hadamard matrix to distribute energy and reduce outliers before quantization. "applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices."
- Hadamard transform: An orthogonal, fast transform with O(N log N) complexity used to decorrelate and Gaussianize data. "The Hadamard transform is fast enough for online application (with O(N log N) complexity)"
- incoherence processing: Preprocessing (e.g., rotations) that makes signals more uniformly distributed to improve quantization. "This is often referred to as incoherence processing."
- Johnson-Lindenstrauss transform: A random projection that approximately preserves inner products/distances in lower dimensions. "a 1-bit quantized Johnson-Lindenstrauss transform to better preserve inner products."
- KL-divergence: A statistical measure of divergence between probability distributions used to assess output degradation. "improves end-to-end KL-divergence more than fixing the other 95%, even though more MSE lies there (see Fig. 9)."
- Kitty: A 2-bit KV-cache quantization method with dynamic channel-wise precision boosting. "Kitty [31] proposes a 2-bit quantization scheme augmented with channel-wise importance selection"
- KIVI: An asymmetric 2-bit KV-cache quantization scheme that quantizes keys per channel and values per token. "KIVI [19] has shown that with round-to-nearest quantization, it is best to quantize the V matrix per token and the K matrix per channel."
- KV-Cache: The storage of past key and value tensors used by attention during long-context inference. "the efficient handling of the KV-Cache is becoming more and more relevant to achieve optimal reasoning per time trade-offs."
- KV-Cache quantization: Compressing the KV-cache to low-bit representations to reduce memory while preserving performance. "One avenue of attack on this memory bottleneck is KV-Cache quantization."
- KVQuant: A KV-cache quantization approach using non-uniform quantization and outlier-aware retention. "KVQuant [9] introduces non-uniform quantization combined with outlier-aware handling"
- kurtosis: A measure of tail heaviness of a distribution; high kurtosis indicates more extreme outliers. "increase the per-channel kurtosis."
- line-retrieval: A synthetic long-context task where a model must retrieve a code from a specified line in a long list. "We give comprehensive results on line-retrieval for various baselines and models, see Tab. 4."
- MATH500: A benchmark of 500 math problems requiring multi-step derivations and exact solutions. "MATH500 is a mathematical reasoning benchmark requiring models to formulate complex derivations and output mathematical solutions."
- outliers: Rare, extreme-value entries that disproportionately affect quantization error and end-to-end performance. "In this sense we can say that fixing outliers is disproportionally important."
- pseudo-decode: An offline evaluation protocol that quantizes the KV-cache in blocks to simulate error accumulation during decoding. "We call this the 'pseudo-decode' setting."
- RoPE (Rotary Positional Embedding): A positional encoding method that rotates token representations to encode positions. "after the RoPE-embedding."
- round-to-nearest (RTN): A scalar quantization rule mapping values to the nearest discrete level. "Finally, it is quantized with round-to-nearest (RTN)."
- Sinkhorn-Knopp normalization: An iterative matrix scaling method used here to equalize variances across rows and columns. "variance-targeted Sinkhorn-Knopp-style normalization."
- test-time scaling: Improving model performance by allocating more compute or generating longer sequences during inference. "Test-time scaling is a powerful approach to obtain better reasoning in LLMs"
- token merging: Compression by combining multiple tokens into fewer, more informative representations. "token merging reduces the number of tokens by combining them into a smaller set of informative representation."
- TurboQuant: An online vector quantization method using random rotations and residual corrections for KV-cache compression. "TurboQuant [33] frames KV cache compression as a near-optimal vector quantization problem."
- Uniform Precision (UP): A setting where all elements of a tensor share the same numerical precision, as opposed to mixed precision. "UP (Uniform Precision) indicates whether all elements within a given K or V tensor share the same precision (V) or use mixed precision (x)."
- variance normalization: Scaling to equalize variance along specified axes to reduce magnitude-related quantization errors. "Variance normalization prevents the rounding process from scaling the norm of worst-case tokens;"
- vector quantization: Representing vectors by nearest entries from a finite codebook to reduce storage. "near-optimal vector quantization problem."
- zeropoint: The offset parameter used with a scale to map integers to real values in affine quantization. "an offset (or zeropoint)"
Collections
Sign up for free to add this paper to one or more collections.