PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization

Published 7 Oct 2024 in cs.LG and cs.CL | (2410.05265v2)

Abstract: Existing weight-activation quantization methods for LLMs primarily address channel-wise outliers but often neglect token-wise outliers, which limits the accuracy of quantized models. In this work, we propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels (W4A4KV4 and W4A8KV4) and granularities (dynamic and static quantization) by effectively isolating token-wise outliers. First, PrefixQuant eliminates token-wise outliers by prefixing outlier tokens in the KV cache, a process that is training-free and highly efficient (e.g., 1 minutes for Llama-3-70B). Second, PrefixQuant introduces new trainable parameters for block-wise training to compensate for quantization error. Our experiments show that PrefixQuant significantly outperforms existing dynamic quantization methods, even under coarser static quantization settings. For instance, PrefixQuant achieves an average accuracy improvement of +3.08 and +2.85 points over SpinQuant (dynamic quantization) on five zero-shot reasoning tasks under dynamic and static quantization settings, respectively, on W4A4KV4 Llama-3-8B. Additionally, we demonstrate up to 2.74x prefilling speedup and 2.16x decoding speedup for LLMs using W4A4 PrefixQuant. Our code is available at https://github.com/ChenMnZ/PrefixQuant.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a static quantization method that isolates high-frequency token outliers offline, eliminating the need for costly per-token dynamic quantization.
It achieves strong performance with a WikiText2 perplexity of 7.43 and a 5.98 point accuracy improvement over previous methods.
The approach accelerates inference by up to 2.81x compared to FP16 models, offering a plug-and-play solution for efficient LLM deployment.

An Overview of PrefixQuant: Static Quantization in LLMs

The paper "PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs" presents a novel approach to quantization in LLMs by proposing a technique known as PrefixQuant. Quantization is a critical process for reducing the memory usage and enhancing the inference speed of LLMs, which are characterized by large parameters and computational demands. Existing quantization techniques often focus on channel-wise outliers, disregarding token-wise outliers, which has resulted in a reliance on costly per-token dynamic quantization methods.

Key Contributions

PrefixQuant introduces a static quantization approach by identifying and isolating outlier tokens offline, thereby eliminating the need for re-training. It strategically pre-processes high-frequency outlier tokens by prefixing them in the KV cache to prevent their generation during inference. This technique simplifies quantization by supporting per-tensor static quantization, which is computationally more efficient than per-token dynamic quantization.

Numerical Results

The paper presents strong numerical results demonstrating the efficacy of PrefixQuant. In the context of the W4A4KV4 quantization setting for the Llama-3-8B model, PrefixQuant achieves a WikiText2 perplexity of 7.43 and an average accuracy of 71.08% across five common-sense reasoning tasks. This performance surpasses previous methods, such as QuaRot, by 0.98 perplexity improvement and an increase of 5.98 accuracy points. Furthermore, PrefixQuant exhibits substantial inference speed improvements, ranging from 1.60x to 2.81x faster than FP16 models, and also surpasses QuaRot models by a factor of 1.2x to 1.3x.

Implications and Future Directions

The introduction of PrefixQuant has significant implications for the deployment of LLMs. By improving static quantization techniques, this approach enhances the inference efficiency and reduces computational overhead, particularly beneficial for real-time applications. The method's capability to outperform dynamic quantization methods without additional training underscores its potential for broader applicability in LLM compression and optimization research.

Additionally, PrefixQuant improves the stability of model training by minimizing the influence of large outliers during Mean Square Error (MSE) loss calculations, positioning itself as a plug-and-play component that can elevate existing optimization-based methods.

Future research avenues may explore further refinements to PrefixQuant's token isolation strategies and investigate its integration with other model compression techniques to maximize efficiency across diverse LLM architectures. The technique also opens possibilities for applying static quantization gains in other domains beyond LLMs.

In summary, PrefixQuant addresses a critical bottleneck in the deployment of LLMs by intelligently managing outlier tokens, thus paving the way for more efficient static quantization processes. This contribution represents a meaningful step towards optimizing the performance of LLMs in resource-constrained environments.