Papers
Topics
Authors
Recent
Search
2000 character limit reached

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

Published 2 Jun 2026 in cs.LG | (2606.03458v1)

Abstract: Test-time scaling is a powerful approach to obtain better reasoning in LLMs, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN

Summary

  • The paper introduces a calibration-free KV-cache quantization method using channel-wise Hadamard rotation and dual-dimension variance normalization.
  • It demonstrates state-of-the-art performance on reasoning and code synthesis benchmarks with significant improvements over previous techniques.
  • The approach mitigates error accumulation in autoregressive decoding with minimal compute and memory overhead, enabling efficient long-context inference.

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

Introduction and Problem Setting

The expansion of long-context inference and test-time scaling in contemporary LLMs has amplified the need for efficient KV-cache management, with memory bottlenecks limiting practical sequence length and throughput. While prior work in KV-cache quantization (e.g., KIVI, TurboQuant, Kittty) has demonstrated competitive compression at low bit-widths, these solutions are primarily evaluated under static, prefill-oriented regimes, neglecting the autoregressive dynamics of real decoding. In such dynamic settings, quantization errors accumulate across decoding steps, resulting in pronounced performance degradation, especially for reasoning tasks with deep computational chains. The paper fundamentally asserts that standard quantization schemes inadequately preserve per-token scaling, leading to magnitude outlier errors that disproportionately impact downstream model quality.

Methodology

KVarN addresses the error accumulation phenomenon through a calibration-free quantization pipeline centered on two core operations:

  1. Channel-wise Hadamard Rotation: By applying a Hadamard transform along the channel dimension, the technique achieves incoherence processing, inducing Gaussianity and attenuating channel space outlier effects. This operation is computationally efficient (O(NlogโกN)O(N \log N)) and amenable to online inference scenarios.
  2. Dual-Dimension Variance Normalization: After rotation, an iterative Sinkhorn-style normalization scheme is applied to both channel and token dimensions, balancing the variance across rows and columns. This process explicitly corrects per-token scaling errorsโ€”identified as the principal driver of end-to-end degradation in quantized KV-caches.

These transformations precede a round-to-nearest quantization step, and the representation retains two scaling vectors (per channel and per token) and a zero-point for accurate dequantization. KVarN's design ensures minimal extra compute: normalization overhead (<0.2% per 128-token chunk) and a negligible increase in dequantization cost (<1.4% over single-scale baselines with scale fusion).

Key Empirical Findings

Error Accumulation Analysis

KVarN departs from the conventional static evaluation by simulating the pseudo-decode regime, where fresh KV-cache outputs are quantized at each block during generation, mimicking true inference mode and capturing cumulative error effects. Rigorous decomposition of quantization error into magnitude and directional components reveals that magnitude errorsโ€”primarily from mis-scaled tokensโ€”dominate the outlier regimes most detrimental to performance. Experimental interventions replacing only the worst 5% of quantized vectors with high-precision alternatives yield major improvements in KL divergence, despite accounting for a minority of the overall MSEโ€”a clear illustration of the non-uniform importance of quantization errors in sequence models.

Strong Numerical Performance

On a comprehensive suite of benchmarksโ€”MATH500, AIME24 (math reasoning), HumanEval (code synthesis), IF-Eval (instruction following), and line retrievalโ€”KVarN consistently matches or outperforms all prior 2-4 bit quantization baselines at 2.3 average bits per element, often with near-lossless accuracy relative to FP16:

  • MATH500 (Phi-4-14B): 84.8% KVarN vs. 77.0% TurboQuant and 74.4% KIVI at 2.3/4.5/2.3 bits respectively.
  • AIME24 (Phi-4-14B): 61.7% KVarN vs. 60.0% PolarQuant and 57.8% KIVI.
  • HumanEval (Qwen3-4B): 88.4% KVarN vs. 86.2% TurboQuant and 86.4% KIVI.
  • Instruction-Following (Qwen3-4B, Strict): 80.4% KVarN vs. 80.3% KIVI and 79.2% TurboQuant.

Crucially, in tasks sensitive to error accumulation over hundreds to thousands of generated tokens, KVarN exhibits a notably lower performance drop relative to unquantized baselines, establishing its robustness for deep-reasoning and long-sequence settings.

Theoretical and Implementation Implications

KVarNโ€™s experimental evidence supports the hypothesis that targeting quantization outliersโ€”rather than optimizing for mean-squared error uniformlyโ€”yields substantially improved autoregressive performance. The combination of Hadamard rotation (to distribute energy and curb channel outliers) and dual-variance normalization (to counteract row/column scale drift) achieves a synergistic effect, containing magnitude errors even in the distribution tails. This sharply contrasts with prior methodsโ€”such as single-dimension scaling, codebook quantization, or channel-importance approachesโ€”that focus on structural or directional distortions while neglecting token-wise norm preservation.

From an implementation perspective, the dual-scaling mechanism incurs minimal extra memory (due to scale/offset storage amortized over per-group quantization) and virtually no latency penalty when fusing scaling into the main kernel. The approach is orthogonal to and compatible with token-merging and cache-eviction strategies.

Broader Impact and Future Directions

KVarNโ€™s direct mitigation of error accumulation opens the door for practical, aggressive KV-cache quantization in production LLM deployment, enabling order-of-magnitude reductions in memory footprint without significant loss in accuracy or reliability for complex reasoning and instruction-following. This unlocks higher-throughput and lower-latency inference, especially in memory-constrained environments or edge applications.

Theoretically, the decomposition of quantization error and empirical finding that outliers drive end-task degradation challenge prevailing approaches focused on average or uniform error minimization, suggesting new criteria for model-centric compression.

Future avenues include:

  • Extending dual scaling and variance normalization to structurally different architectures (SSMs, models without standard KV-caches), though limitations already noted for non-transformer models,
  • Investigating end-to-end joint train-time quantization-aware fine-tuning with KVarN,
  • Combining KVarN with adaptive precision or hybrid eviction strategies,
  • Analyzing token- or task-specific scaling schedules to further minimize error accumulation under diverse generative workloads.

Conclusion

KVarN introduces a variance-normalized, outlier-aware quantization scheme for transformer KV-caches that effectively limits error accumulation in long-context, autoregressive decoding. By explicitly targeting token scale deviationsโ€”using a blend of incoherence processing and dual-axis normalizationโ€”KVarN achieves state-of-the-art compression/accuracy tradeoffs, particularly in deep-reasoning and long-sequence benchmarks, and does so with negligible compute and memory overhead. The technique sets new baselines for practical efficient LLM deployment and suggests new priorities for future model compression research.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

KVarN: Making AI โ€œMemoryโ€ Smaller without Messing Up Its Thinking

Overview (What this paper is about)

This paper is about a way to shrink the โ€œmemoryโ€ that LLMs use while they are thinking and writing long answers. That memory is called the KV-cache. The new method, called KVarN, compresses this memory very aggressively (down to about 1/7 the usual size) but keeps the modelโ€™s reasoning quality almost the same. It especially helps on long, step-by-step problems like math, coding, and following detailed instructions.

Key Questions the Paper Tries to Answer

  • Can we compress the KV-cache to save lots of memory during long generations, without making the modelโ€™s answers worse?
  • Why do current compression methods break down when the model generates text step by step (instead of reading a long prompt all at once)?
  • What kinds of compression mistakes hurt the most, and how can we prevent them?

How They Did It (Methods, in simple terms)

First, a quick idea of terms:

  • KV-cache: A modelโ€™s short-term memory while it generates text. It stores โ€œkeysโ€ (K) and โ€œvaluesโ€ (V) that help the model pay attention to earlier words.
  • Quantization: Squeezing numbers into fewer bits (like turning a full-color photo into a small file with fewer colors). It saves memory but can add small errors.

Whatโ€™s the problem?

  • Many methods are tested only when the model reads a long prompt once (โ€œprefillโ€), not when it generates text step by step.
  • During generation, errors can pile up over time. Think of photocopying a photocopy again and again: small mistakes accumulate.

What kind of error is worst?

  • Each token (word piece) has a โ€œscaleโ€ or โ€œloudnessโ€ (its overall size). The paper shows that the biggest mistakes come from getting this scale wrong for some tokens, not from tiny twists in direction. These โ€œscaleโ€ mistakes are rare but very harmful.

Whatโ€™s KVarNโ€™s trick? KVarN uses two simple moves that work together:

  1. Hadamard rotation: Imagine shuffling and spreading information evenly across channels so no single channel is too โ€œspiky.โ€ This makes the data easier to compress.
  2. Variance normalization in two directions (dual-scaling): Imagine turning volume knobs so every row (per token) and every column (per channel) has a balanced volume. This keeps each tokenโ€™s โ€œloudnessโ€ right, which prevents those harmful scale mistakes.

How do they test it?

  • Besides standard tests, they introduce a โ€œpseudo-decodeโ€ setup that mimics real generation: after every small block of tokens (e.g., 128), they compress the KV-cache and then keep going. This shows how errors actually accumulate over time during long reasoning.

Main Findings (What they discovered and why it matters)

  • Scale mistakes dominate the worst errors: The most damaging errors come from getting a tokenโ€™s overall size wrong. Fixing these outliers helps more than reducing lots of small, harmless errors.
  • KVarN reduces error accumulation: In the realistic, step-by-step setting, KVarN keeps errors from piling up as much as before.
  • Strong results at very low precision (about 2.3 bits per value): On tough generative benchmarks (MATH500, AIME24, HumanEval, IFEval), KVarN matches or beats previous methods while using roughly 1/7 the memory of standard 16-bit storage.
  • Tiny speed cost: The extra work (the normalization step and using two scales) adds well under 1% overhead in practice, which is negligible compared to generation time.

Here are the kinds of tasks where KVarN shines:

  • MATH500 and AIME24: Competition-style math that needs long chains of thought.
  • HumanEval: Writing correct Python code from problem descriptions.
  • IFEval: Following strict instructions (like exact formats and length limits).

Why This Matters (Implications)

  • Longer, better thinking within the same memory: With KVarN, models can handle longer reasoning without running out of memory or slowing down too much.
  • Cheaper deployment: Using less memory means running strong models on smaller or fewer GPUs, making advanced reasoning more accessible.
  • More reliable long answers: By preventing error build-up, KVarN helps models stay accurate over long, step-by-step solutions (like solving multi-step math or writing multi-function code).
  • Plays well with real systems: The method fits into popular serving frameworks (they even provide a vLLM implementation), making it practical to adopt.

In one sentence

KVarN keeps the modelโ€™s per-token โ€œvolumeโ€ correct while compressing, which stops small mistakes from snowballing, letting LLMs think longer and better using much less memory.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains uncertain or unexplored in the paper, phrased to guide concrete future research.

  • Generalization to larger and diverse architectures:
    • Does KVarNโ€™s advantage persist on substantially larger dense models (e.g., 34Bโ€“70B+) and on MoE models, grouped/multi-query attention, or sliding-window/local attention variants?
    • How does KVarN interact with architectural features such as ALiBi, different RoPE variants, cross-attention (for multi-modal models), or architectures without a KV cache (SSMs), beyond the brief limitation note?
  • Robustness across tasks and data distributions:
    • Performance under long-form generation beyond reasoning/coding (e.g., summarization, long conversational dialogs, RAG-heavy workflows, multilingual settings) is not evaluated.
    • Sensitivity to domain shifts and adversarial/atypical token statistics (e.g., code-mixed inputs, highly repetitive or extremely sparse contexts) remains unknown.
  • Pseudo-decode vs real decoding:
    • The proposed โ€œpseudo-decodeโ€ evaluation is a proxy; its quantitative correlation with true online decoding outcomes (under stochastic sampling, branching, and beam search) is not established.
    • Impact under test-time scaling with many parallel branches (self-consistency / tree-of-thought) on both quality and throughput is unmeasured.
  • Rateโ€“distortion and precision trade-offs:
    • No full rateโ€“distortion curves across bitwidths (1โ€“4 bits) or mixed-precision schedules; results focus mainly on 2-bit. How does KVarN compare at other operating points?
    • Comparative fairness and robustness across baselines with matched memory budgets (e.g., TurboQuant config that doesnโ€™t leave layers unquantized) is not exhaustively studied.
  • Sensitivity to implementation hyperparameters:
    • Effect of group size (G), sink tokens (S), and trailing unquantized tokens (R) on quality/latency/memory is not systematically explored (e.g., G โˆˆ {32, 64, 256}, S minimal viable value).
    • Convergence behavior of the iterative variance normalization: minimum iterations needed, adaptive stopping criteria, and numerical stability across models/layers.
    • Influence of scale/zeropoint precision (e.g., FP8 format variants E4M3 vs E5M2, FP16 vs INT8) and quantization of auxiliary parameters on end quality.
  • Theoretical understanding and guarantees:
    • Formal analysis linking per-token magnitude errors to perturbations of attention logits/softmax and to downstream task degradation is largely empirical; theoretical bounds on error accumulation over layers/time are absent.
    • Conditions under which magnitude errors dominate directional errors (model- and data-dependent) are not characterized.
  • V-cache behavior and per-head heterogeneity:
    • The paper emphasizes K-magnitude errors; a deeper analysis of V quantization errors, their accumulation, and head-wise/layer-wise heterogeneity in sensitivity is missing.
    • Potential for per-head or per-layer adaptive scaling/bit allocation to target especially sensitive heads is unexplored.
  • Alternative transforms and normalizations:
    • Comparison with learned/structured rotations (e.g., learned orthogonal, butterfly) or token-axis transforms beyond Hadamard is not provided.
    • Exploration of alternative normalization objectives (e.g., Lp normalization, whitening, robust/Huber scaling, kurtosis control) and their impact on outliers is absent.
    • Online re-normalization of previously stored cache segments (periodic re-scaling of older tokens) as a means to control drift/accumulation is not investigated.
  • Interaction with other compression strategies:
    • Empirical combinations with eviction/token-merging methods (H2O, SnapKV, PyramidKV, KVZip, CaM, D2O) are not reported, despite orthogonality; best-practice recipes to co-tune parameters across methods remain open.
    • Compatibility with train-time KV compression methods (e.g., MLA) and their quantization dynamics is unclear.
  • Practical deployment and systems aspects:
    • End-to-end serving throughput/latency under realistic loads (multi-request batching, paged attention, heterogeneous sequence lengths) with 2-bit KV caches is not measured due to framework support gaps.
    • Memoryโ€“compute trade-offs at scale (e.g., GPU concurrency limits, cache eviction policies, NUMA/CPU-offload scenarios) and the impact of extra dequantization scale on kernels are not benchmarked across hardware.
  • Decoding behavior and calibration:
    • Effects on probability calibration, log-likelihood/perplexity, and decoding stability (e.g., repetition, exposure bias) are not reported.
    • Influence of stochastic vs deterministic rounding schemes and of temperature/top-p settings on error accumulation and quality is not analyzed.
  • Reproducibility and comparability details:
    • Strong baselines may rely on community implementations (e.g., TurboQuant with unquantized layers), complicating apples-to-apples comparisons; standardized evaluation with controlled budgets is needed.
    • Public datasets used for error diagnostics (e.g., wikitext-2 subsets) may not capture distributions encountered in long-horizon reasoning; broader diagnostics could improve external validity.

Practical Applications

Immediate Applications

The following applications can be deployed now by integrating KVarNโ€™s variance-normalized KV-cache quantization (Hadamard rotation + dual-axis scaling) into existing transformer-based LLM inference pipelines. These leverage demonstrated nearโ€“FP16 quality at ~2.3 bits/element and negligible runtime overhead (~0.18% for normalization; ~1% for dual-scale dequantization).

  • Cloud LLM serving and APIs (software, cloud)
    • Use case: Increase context length, number of parallel chains-of-thought, or request concurrency at the same GPU memory budget by storing KV caches at ~2.3 bits/elem with minimal quality loss (AIME24, MATH500, HumanEval, IF-Eval).
    • Tools/workflows: vLLM implementation of KVarN; memory planners that set sink/body/trailing-token partitions (e.g., S=128, G=128, R=128); attention kernels that apply per-head Hadamard rotation and dual scaling; โ€œpseudo-decodeโ€ evaluation to qualify models for production.
    • Assumptions/dependencies: Transformer models with standard KV caches; ability to add Hadamard transforms (some absorbed into weights) and dual scales per group; serving stack with INT2 storage support (custom or patched kernels); performance overhead remains negligible on target hardware.
  • Test-time scaling for reasoning agents (software, robotics, automation)
    • Use case: Run more branches/rollouts (self-consistency, tree/graph/forest-of-thought) and longer chains without OOM, reducing error accumulation across timesteps in long-horizon decoding.
    • Tools/workflows: Agent orchestration that budgets memory based on quantized KV; schedule KVarN per layer/head; combine with parallel decoding or speculative strategies; pseudo-decode regression tests.
    • Assumptions/dependencies: Reasoning workloads produce long decode phases; agent stack tolerates the small dequant overhead; quality is monitored with accumulate-aware metrics.
  • RAG and long-context retrieval/chat (enterprise, education, healthcare, finance)
    • Use case: Fit more retrieved passages, longer documents, or compliance/policy contexts into prompts without increasing GPU count; improve instruction adherence (IF-Eval) and long-context retrieval accuracy.
    • Tools/workflows: Retriever re-tuning to exploit larger effective context; context packaging policies that assume 2-bit KV cache; service-level objectives updated for longer prompts.
    • Assumptions/dependencies: Data and governance teams validate that accuracy is near-FP16 for target domains; serving framework supports modified attention kernels; privacy/security controls unchanged.
  • IDE/code assistants and CI bots (software engineering)
    • Use case: Maintain near-FP16 HumanEval performance while expanding in-file and repository context windows (e.g., larger diffs, more files, longer code histories).
    • Tools/workflows: IDE plugins or CI runners backed by a KVarN-enabled model endpoint; memory budgeting that scales with file count.
    • Assumptions/dependencies: Endpoint integrates KVarN; code tasks benefit from longer decode and context rather than only prefill.
  • On-device and edge assistants (mobile, embedded, robotics)
    • Use case: Enable longer conversations and task contexts on memory-limited devices by compressing KV cache to 2-bit; reduce DRAM traffic for better latency/energy in long interactions.
    • Tools/workflows: Mobile/edge inference stacks (e.g., custom runtimes) with KVarN kernels; per-chunk (e.g., 128 tokens) online variance normalization.
    • Assumptions/dependencies: Device NPUs/GPUs support fast Hadamard and scaled dequant; storage at 2-bit and FP8/FP16 scales is feasible; models are transformer-based with KV caches.
  • Academic and benchmarking practice (academia)
    • Use case: Adopt โ€œpseudo-decodeโ€ evaluation to measure KV error accumulation; use KVarN as a state-of-the-art baseline for KV-cache quantization studies; analyze outlier/magnitude-driven error contributions.
    • Tools/workflows: Open-source vLLM implementation; ablation harnesses that track magnitude vs. directional error, and quantile contributions to end-to-end metrics (e.g., KL divergence).
    • Assumptions/dependencies: Community benchmarks incorporate decode-accumulation settings; reproducibility via public code/models.
  • Cost and energy optimization initiatives (industry policy/ops)
    • Use case: Reduce GPU memory footprint and associated memory-bandwidth energy during long decode; lower cost per token for reasoning-heavy workloads.
    • Tools/workflows: Capacity planning that assumes ~7ร— KV compression (16bโ†’~2.3b); energy reporting that attributes savings to reduced KV traffic.
    • Assumptions/dependencies: Realized energy/cost savings depend on workload mix and memory boundness; organizational acceptance of low-bit caches for production.

Long-Term Applications

These opportunities require broader ecosystem support, further research, or integration work (e.g., kernel/driver support, training-time co-design, and cross-architecture generalization).

  • Native 2-bit KV-cache support across serving stacks and hardware (software, hardware)
    • Use case: Standardize INT2 storage formats/kernels in CUDA/ROCm, TensorRT-LLM, TGI, and other frameworks; hardware-friendly Hadamard ops and dual-scale dequantization.
    • Potential products: Turnkey โ€œlow-bit KVโ€ mode in major serving frameworks; accelerator libraries with fused Hadamard+dequant kernels.
    • Assumptions/dependencies: Vendor buy-in; kernel/compiler optimizations; broad testing across GPUs/NPUs.
  • Precision scheduling and hybrid compression (software)
    • Use case: Dynamically combine KVarN with eviction/token-merging (SnapKV, PyramidKV, H2O, KVZip, CaM, D2O) for >10ร— effective context extension under tight memory.
    • Potential workflows: Layer-wise or time-varying precision schedules; mixed K/V precision tuned to attention patterns and error accumulation.
    • Assumptions/dependencies: Robust controllers to avoid quality cliffs; telemetry to detect outlier errors in-flight.
  • Training-time co-design (academia, foundation model providers)
    • Use case: Quantization-aware finetuning or pretraining that internalizes Hadamard/dual-scaling structure; learned rotations; layers regularized for magnitude stability to suppress outliers.
    • Potential products: โ€œKV-quant-readyโ€ checkpoints; adapters that minimize magnitude-driven outliers.
    • Assumptions/dependencies: Access to training pipelines and data; compute budget; evidence that train-time changes further reduce accumulation.
  • Extension to newer attention variants and architectures (software, robotics, multimodal)
    • Use case: Adapt KVarN-like variance normalization to KV caches in vision transformers, video-LLMs, and agentic/robotic planners with long-horizon memory; explore compatibility with MLA and other train-time KV compression schemes.
    • Potential tools: Cross-modal cache compressors; robot stacks with long-memory budgets.
    • Assumptions/dependencies: Similar cache structures are present; interactions with MLA or alternative cache designs are favorable.
  • Ultraโ€“long-context pipelines and โ€œelastic contextโ€ services (cloud)
    • Use case: Offer context-on-demand (e.g., 1โ€“10M tokens) by layering KVarN with eviction/merging and segment-wise rolling caches; intelligent paging across GPUs/CPUs.
    • Potential products: โ€œElastic context serversโ€ that trade precision vs. length at runtime; SLAs tied to decode-accumulation metrics.
    • Assumptions/dependencies: Stable algorithms for very long horizons; robust paging and failure handling.
  • Standardized evaluation and compliance frameworks (policy, industry consortia)
    • Use case: Create decode-accumulation-aware benchmarks and certification for low-bit KV caches (akin to MLPerf segments) to ensure reliability in regulated sectors (healthcare, finance, gov).
    • Potential workflows: Compliance reports that include outlier-error analyses; standardized pseudo-decode protocols.
    • Assumptions/dependencies: Community consensus on metrics and task suites; third-party evaluation infrastructure.
  • Broader activation compression (software, hardware co-design)
    • Use case: Extend dual-axis variance normalization to other activations beyond KV (e.g., intermediate attention outputs) where accumulation matters, with low overhead.
    • Potential products: General โ€œvariance-normalized activation compressionโ€ libraries; fused kernels in accelerators.
    • Assumptions/dependencies: Careful QoS validation to avoid compounding errors; workload-specific tuning.
  • Privacy-preserving on-prem and offline reasoning (enterprise, public sector)
    • Use case: Maintain FP16-like reliability for confidential, long-context tasks (e.g., legal discovery, medical case summaries) on fewer or smaller GPUs by compressing KV caches.
    • Potential products: Secure, memory-optimized on-prem solutions; offline laptop/edge deployments for long-form analytics.
    • Assumptions/dependencies: Validation on domain data; governance sign-off; hardware/kernel readiness for 2-bit KV at scale.

Glossary

  • AIME24: A competition-level mathematics benchmark used to evaluate long-form reasoning in LLMs. "including MATH500, AIME24 and HumanEval, at 2-bit precision."
  • attention logits: The pre-softmax scores produced by attention mechanisms, determining how much each token attends to others. "impacts the attention logits."
  • attention sink: Special initial tokens retained at high precision to stabilize attention during decoding. "S sink tokens in FP16 (preserving attention sink behavior)"
  • autoregressive decoding: Token-by-token generation where each new token conditions on previously generated tokens. "errors behave differently under autore- gressive decoding."
  • Bernoulli-based masking: A stochastic masking strategy where elements are dropped/kept according to Bernoulli trials. "applies a Bernoulli-based masking process to merge value states within the KV cache during long-sequence generation."
  • calibration-free: A quantization approach that does not require calibration data or activation samples. "We introduce KVarN, a calibration-free KV-cache quantizer"
  • codebook-based methods: Quantization methods that represent data using indices into a learned set of prototype vectors (a codebook). "codebook-based methods like TurboQuant [33]"
  • dual-scaling: Applying separate scaling along two tensor axes (e.g., token and channel) to normalize variances. "We identify dual-scaling with Sinkhorn-based variance-normalization as an effective approach to further mitigate token scaling errors."
  • Hadamard rotation: Rotating data using a Hadamard matrix to distribute energy and reduce outliers before quantization. "applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices."
  • Hadamard transform: An orthogonal, fast transform with O(N log N) complexity used to decorrelate and Gaussianize data. "The Hadamard transform is fast enough for online application (with O(N log N) complexity)"
  • incoherence processing: Preprocessing (e.g., rotations) that makes signals more uniformly distributed to improve quantization. "This is often referred to as incoherence processing."
  • Johnson-Lindenstrauss transform: A random projection that approximately preserves inner products/distances in lower dimensions. "a 1-bit quantized Johnson-Lindenstrauss transform to better preserve inner products."
  • KL-divergence: A statistical measure of divergence between probability distributions used to assess output degradation. "improves end-to-end KL-divergence more than fixing the other 95%, even though more MSE lies there (see Fig. 9)."
  • Kitty: A 2-bit KV-cache quantization method with dynamic channel-wise precision boosting. "Kitty [31] proposes a 2-bit quantization scheme augmented with channel-wise importance selection"
  • KIVI: An asymmetric 2-bit KV-cache quantization scheme that quantizes keys per channel and values per token. "KIVI [19] has shown that with round-to-nearest quantization, it is best to quantize the V matrix per token and the K matrix per channel."
  • KV-Cache: The storage of past key and value tensors used by attention during long-context inference. "the efficient handling of the KV-Cache is becoming more and more relevant to achieve optimal reasoning per time trade-offs."
  • KV-Cache quantization: Compressing the KV-cache to low-bit representations to reduce memory while preserving performance. "One avenue of attack on this memory bottleneck is KV-Cache quantization."
  • KVQuant: A KV-cache quantization approach using non-uniform quantization and outlier-aware retention. "KVQuant [9] introduces non-uniform quantization combined with outlier-aware handling"
  • kurtosis: A measure of tail heaviness of a distribution; high kurtosis indicates more extreme outliers. "increase the per-channel kurtosis."
  • line-retrieval: A synthetic long-context task where a model must retrieve a code from a specified line in a long list. "We give comprehensive results on line-retrieval for various baselines and models, see Tab. 4."
  • MATH500: A benchmark of 500 math problems requiring multi-step derivations and exact solutions. "MATH500 is a mathematical reasoning benchmark requiring models to formulate complex derivations and output mathematical solutions."
  • outliers: Rare, extreme-value entries that disproportionately affect quantization error and end-to-end performance. "In this sense we can say that fixing outliers is disproportionally important."
  • pseudo-decode: An offline evaluation protocol that quantizes the KV-cache in blocks to simulate error accumulation during decoding. "We call this the 'pseudo-decode' setting."
  • RoPE (Rotary Positional Embedding): A positional encoding method that rotates token representations to encode positions. "after the RoPE-embedding."
  • round-to-nearest (RTN): A scalar quantization rule mapping values to the nearest discrete level. "Finally, it is quantized with round-to-nearest (RTN)."
  • Sinkhorn-Knopp normalization: An iterative matrix scaling method used here to equalize variances across rows and columns. "variance-targeted Sinkhorn-Knopp-style normalization."
  • test-time scaling: Improving model performance by allocating more compute or generating longer sequences during inference. "Test-time scaling is a powerful approach to obtain better reasoning in LLMs"
  • token merging: Compression by combining multiple tokens into fewer, more informative representations. "token merging reduces the number of tokens by combining them into a smaller set of informative representation."
  • TurboQuant: An online vector quantization method using random rotations and residual corrections for KV-cache compression. "TurboQuant [33] frames KV cache compression as a near-optimal vector quantization problem."
  • Uniform Precision (UP): A setting where all elements of a tensor share the same numerical precision, as opposed to mixed precision. "UP (Uniform Precision) indicates whether all elements within a given K or V tensor share the same precision (V) or use mixed precision (x)."
  • variance normalization: Scaling to equalize variance along specified axes to reduce magnitude-related quantization errors. "Variance normalization prevents the rounding process from scaling the norm of worst-case tokens;"
  • vector quantization: Representing vectors by nearest entries from a finite codebook to reduce storage. "near-optimal vector quantization problem."
  • zeropoint: The offset parameter used with a scale to map integers to real values in affine quantization. "an offset (or zeropoint)"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 43 likes about this paper.