Create a Video View Paper

KVarN: Taming Quantization Errors in Long-Context Reasoning

This lightning talk explores KVarN, a novel approach to KV-cache quantization that addresses a critical but overlooked problem: error accumulation during autoregressive decoding. By combining channel-wise Hadamard rotation with dual-dimension variance normalization, KVarN achieves aggressive compression to 2.3 bits per element while maintaining near-lossless accuracy on complex reasoning tasks, outperforming all prior methods where it matters most—when errors compound across hundreds or thousands of generated tokens.

Script

Compress a language model's memory cache too aggressively, and something insidious happens: errors don't just exist, they multiply. Each token the model generates builds on the last, and quantization mistakes compound across hundreds of steps, silently destroying performance on the reasoning tasks that matter most.

The authors built KVarN around a sharp insight: standard quantization fails because it can't preserve per-token scaling, allowing magnitude outliers to wreak havoc. Their solution applies two core operations. First, a Hadamard rotation along channels spreads energy and tames outliers. Second, a dual-dimension variance normalization balances scale across both tokens and channels, explicitly correcting the per-token drift that causes downstream collapse.

Traditional quantization methods are evaluated in static, prefill-only scenarios that mask the real problem. KVarN was tested under pseudo-decode conditions, where cache outputs are quantized afresh at every block during generation, mimicking true inference and exposing cumulative error. The results are striking: replacing just the worst 5 percent of quantized vectors with high-precision alternatives yields massive improvements in output quality, proving that outlier errors dominate, not average distortion.

On MATH500, AIME24, and HumanEval benchmarks, KVarN at 2.3 bits per element matches or beats every prior 2 to 4 bit method. Phi-4 on MATH500 hits 84.8 percent accuracy with KVarN, versus 77 percent for TurboQuant and 74.4 percent for KIVI. The gap widens precisely where error accumulation matters: tasks requiring hundreds of tokens and deep reasoning chains, where prior methods collapse and KVarN holds steady.

The dual-scaling mechanism adds virtually no overhead: normalization costs under 0.2 percent per 128-token chunk, and dequantization latency increases by less than 1.4 percent when scale fusion is applied. This makes KVarN practical for production deployment, unlocking order-of-magnitude memory reductions without sacrificing throughput or reliability in memory-constrained and edge environments.

KVarN reframes quantization as an outlier control problem, not just an average error minimization task, and the evidence is clear: magnitude errors in the tail matter far more than uniform distortion. This opens new directions for compression research, from extending dual normalization to state-space models, to joint quantization-aware fine-tuning, to task-adaptive scaling schedules. If you want to dive deeper into KVarN or create your own research video breakdowns, head over to EmergentMind.com and explore what's possible.