SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention

Published 21 Feb 2025 in cs.LG, cs.AI, and cs.CL | (2502.15304v1)

Abstract: For the efficient inference of LLMs, the effective compression of key-value (KV) cache is essential. Three main types of KV cache compression techniques, namely sparsity, channel compression, and quantization, have been identified. This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed precision quantization method for K cache. Initially, K cache is transformed into latent channels using SVD basis representations. Since the values in latent channels decay rapidly and become negligible after only a few latent channels, our method then incorporates importance-aware quantization and compression for latent channels. This enables the effective allocation of higher precision to more significant channels. Theoretically, we prove that SVDq results in quantization errors (x0.1 or even lower) that are much lower than those of per-channel key quantization in the original space. Our findings based on RULER and LongBench benchmarks demonstrate that SVDq can achieve an equivalent key cache precision as low as 1.25-bit. When combined with key sparsity, it can reach a key compression ratio of up to 410x for attention computation, all while maintaining comparable model performance. Notably, our method is nearly lossless for LongBench datasets. This indicates that SVDq enables high-precision low-bit quantization, providing a more efficient solution for KV cache compression in LLMs.

Abstract PDF Upgrade to Chat

Summary

The paper introduces SVD-based mixed-precision quantization for LLM KV cache compression, achieving effective 1.25-bit precision and up to 410x compression.
It employs singular value decomposition to transform the K cache into latent channels, allowing targeted precision allocation based on channel significance.
Experimental results on RULER and LongBench benchmarks show maintained model accuracy and compatibility with sparsity techniques for efficient inference.

SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention

Introduction and Background

Efficient inference of LLMs heavily depends on the compression of the key-value (KV) cache, a critical component responsible for encoding past information during attention computations. Existing approaches to KV cache compression fall into three categories: sparsity, channel compression, and quantization. Each of these methods targets different aspects of the KV cache to improve efficiency and reduce memory footprint.

SVDq introduces a novel compression method based on Singular Value Decomposition (SVD) combined with mixed precision quantization specifically for the $K$ cache. This approach leverages SVD to transform the original $K$ cache into latent channels, exploiting the rapid decay of values in these channels for efficient quantization.

Given the prominent role of LLMs in various AI applications, the demand for efficient inference mechanisms is growing. Specifically, as the size of KV caches expands with longer sequences or batch sizes, traditional inference approaches face substantial bottlenecks in terms of memory consumption and latency. SVDq aims to alleviate these issues, allowing LLMs to be deployed on memory-constrained devices while maintaining high performance.

Figure 1: Original K.

SVD-Based Quantization Technique

In SVDq, the $K$ cache is decomposed using SVD into orthonormal basis representations, adhering to the Eckart–Young–Mirsky theorem. These basis representations allow the transformation of the $K$ cache into smaller latent channels that decay rapidly. The decay in singular values suggest that many latent channels contribute negligibly to the overall information retention, providing a basis for selectively applying precision-aware quantization.

Figure 2: Diagram of SVDq method (path inside the box in green) versus direct per-channel quantization (dash path inside the box in violet).

SVDq employs a mixed-precision quantization scheme that initially assigns higher bit widths to more significant latent channels and progressively decreases the precision for less significant channels. This method avoids the pitfalls of uniform quantization in the original channel dimension, where variance is more uniformly distributed, often resulting in significant quantization errors.

Theoretical Insights

The paper provides a theoretical foundation for SVDq, showcasing that $\mathbb{E}[(\mathcal{P}_V(K))^2]$ , the variance of the projected latent channels, is proportional to the square of corresponding singular values. With this insight, SVDq's quantization scheme effectively mitigates quantization errors by allocating precision based on the significance of singular values. High precision is applied where singular values—and hence variances—are high, minimizing information loss during compression.

The theoretical analysis illustrates that SVDq achieves quantization errors substantially lower than per-channel quantization in the original key space. This results in retaining nearly lossless model performance even at lower bit widths, such as an equivalent precision of 1.25-bit when combined with sparsity techniques.

Experimental Results

Experiments on RULER and LongBench benchmarks demonstrate the efficacy of SVDq in achieving compressions up to 410x while maintaining comparable model performance. Notably, SVDq outperforms existing approaches such as ThinK and direct per-channel quantization in preserving model accuracy during inference.

SVDq proves compatible with sparsity techniques, such as those used in ShadowKV, where it achieves even higher compression ratios (up to 410x) with negligible performance degradation. This compatibility showcases its versatility in adapting to various compression strategies within the KV cache compression landscape.

Practical Implications and Future Directions

The implications of SVDq span both practical and theoretical realms. Practically, it provides a robust solution to KV cache compression, enabling deployment of performant LLMs with minimized memory footprint. Theoretically, it casts light on the profound impact of singular value distributions in designing quantization strategies.

Potential future developments include exploring SVDq's integration with other attention mechanisms beyond transformers, investigating its potential in real-time applications, and optimizing its computational aspects to reduce inference latency further.

Conclusion

SVDq represents a significant advancement in KV cache compression for LLMs by integrating SVD with mixed precision quantization. Through its innovative approach, it achieves high compression rates, particularly when combined with sparsity, while ensuring sustained model accuracy. By addressing KV cache memory constraints, SVDq facilitates efficient inference, paving the way for broader application in AI-driven technologies.

Markdown Report Issue