FlatQuant: Flatness Matters for LLM Quantization

Published 12 Oct 2024 in cs.CL and cs.LG | (2410.09426v3)

Abstract: Recently, quantization has been widely used for the compression and acceleration of LLMs. Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still exhibit steep and dispersed distributions. In this paper, we propose FlatQuant (Fast and Learnable Affine Transformation), a new post-training quantization approach that enhances the flatness of weights and activations. Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead of affine transformation, we apply Kronecker product with two lightweight matrices, and fuse all operations in FlatQuant into a single kernel. Extensive experiments demonstrate that FlatQuant establishes a new state-of-the-art benchmark for quantization. For example, it achieves less than 1\% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5\%. Additionally, it provides up to 2.3x prefill speedup and 1.7x decoding speedup compared to the FP16 model. Code is available at: https://github.com/ruikangliu/FlatQuant.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces FlatQuant, a post-training quantization (PTQ) framework for LLMs that uses learned affine transformations to flatten weight and activation distributions, effectively reducing quantization error.
FlatQuant employs Kronecker decomposition to efficiently represent transformations and uses fused kernels for operations, minimizing computational overhead and achieving significant inference speedups.
FlatQuant achieves state-of-the-art W4A4 quantization with less than a 1% accuracy drop on LLaMA-3-70B, delivering up to 2.3x prefill and 1.7x decoding speedups.

Overview

The paper "FlatQuant: Flatness Matters for LLM Quantization" (2410.09426) introduces a novel post-training quantization (PTQ) framework designed to address the persistent issues caused by outlier distributions in both weights and activations in LLMs. By emphasizing the importance of flattening these distributions, the proposed method seeks to reduce quantization error when using equally spaced quantization levels. Rather than relying solely on pre-quantization transformations, such as per-channel scaling or Hadamard transforms, FlatQuant operates as a post-training approach that learns optimal, layer-specific affine transformations. This methodology results in significantly improved quantization accuracy and lower inference latency compared to state-of-the-art approaches.

Methodology

Learned Affine Transformations

A core innovation in FlatQuant is the optimization of an invertible affine transformation for each linear layer. For a given linear operation, expressed as:

$Y = XW^T,$

the approach seeks an optimal transformation matrix $P$ such that the quantized operation minimizes the quantization error:

$P^* = \arg\min_P \|Y - Q(XP) \cdot Q(P^{-1}W^T)\|_F^2,$

where $Q(\cdot)$ denotes the quantization function. This formulation permits the decoupling of the steep distributions caused by outliers by strategically learning a transformation that promotes “flatness” in the weight and activation distributions.

Kronecker Decomposition

To address the computational and memory overhead of storing a full transformation matrix for each layer, FlatQuant utilizes Kronecker decomposition. The transformation matrix $P$ is decomposed as:

$P = P_1 \otimes P_2,$

with $P_1$ and $P_2$ being smaller invertible matrices. This decomposition not only reduces the number of learnable parameters but also lessens the computational burden during both calibration and inference. Such a decomposition enables effective back-propagation of quantization errors while maintaining the structural balance between the dimensions of the involved matrices.

Per-Channel Scaling and Clipping Thresholds

FlatQuant further incorporates learnable per-channel scaling vectors to harmonize the variance between weights and activations. This is critical in managing the impact of outliers prior to the affine transformation. Additionally, learnable clipping thresholds ( $\alpha_w$ and $\alpha_a$ ) are applied to ensure that extreme values, even after applying the affine transformations, do not adversely affect the quantization process. These parameters, calibrated with a modest set of calibration data, help in maintaining a tight distribution that is resilient to quantization-induced accuracy degradation.

Efficient Kernel Fusion

To mitigate the typical latency overhead introduced by pre-quantization transformations, the authors fuse the affine transformation, quantization, and Kronecker product operations into a single custom kernel. Implemented using OpenAI Triton, this fused operator loads the transformation matrices into SRAM, performs the requisite matrix operations entirely in memory, and subsequently writes back the results. This design choice minimizes memory access latency, facilitating significant speed improvements during both the prefill and decoding phases.

Experimental Evaluation

Accuracy and Performance Benchmarks

The experimental results presented in the paper are quite compelling with respect to both quantization error and inference speed:

Quantization Accuracy: When applying W4A4 quantization on the LLaMA-3-70B model, FlatQuant achieves an accuracy drop of less than 1%, which is particularly noteworthy given the high sensitivity of LLMs to quantization errors. This performance exceeds that of comparable methods such as SpinQuant by a margin of 7.5%.
Zero-Shot QA: The method also shows strong performance on zero-shot tasks across various QA benchmarks (ARC-Challenge, LAMBADA, etc.), reducing the gap between quantized models and FP16 baselines.

Inference Latency Improvements

Prefill and Decoding Speed: By fusing operations into a unified kernel, FlatQuant drastically reduces the latency overhead often incurred by pre-quantization transformations. Specifically, it reduces the additional runtime from 0.26x (as noted for QuaRot) to just 0.07x, resulting in up to a 2.3× speedup in prefill and a 1.7× speedup in decoding.
Memory Efficiency: The use of Kronecker decomposition plays a significant role in lowering both computational and memory requirements, making the method viable for deployment in resource-constrained environments.

Discussion and Implications

The FlatQuant approach underlines the importance of “flatness” in quantization strategies. By directly targeting and reducing the steepness of weight and activation distributions, the method facilitates more effective quantization even in low-bit regimes (e.g., W4A4). The framework’s reliance on learnable affine transformations, efficient matrix decompositions, and fused kernel operations renders it not only effective in terms of accuracy preservation but also highly efficient for practical deployment.

Strong numerical results reinforce the practicality of the approach—particularly the sub-1% accuracy drop in aggressive quantization scenarios, combined with notable speedups in inference—making it well-suited for real-world applications where both performance and latency are critical trade-offs.

Furthermore, the methodology is versatile enough to be extended to other quantization settings (e.g., weight-only quantization and KV cache quantization) with minimal performance degradation. For practitioners, these characteristics could lead to significant improvements in deploying LLMs on limited hardware without sacrificing model responsiveness or accuracy.

Conclusion

FlatQuant presents a sophisticated and highly practical framework for LLM quantization that directly tackles the challenge of outlier-induced quantization errors by enforcing flat distributions through learned affine transformations. Its incorporation of Kronecker decomposition minimizes overhead, while the use of fused kernels ensures negligible latency impact. The method sets a new benchmark in low-bit quantization for LLMs, making it an attractive option for both academic research and real-world deployment scenarios where inference speed and model accuracy are paramount.