ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

Published 11 Sep 2025 in cs.LG, cs.AI, and cs.CL | (2509.09679v1)

Abstract: LLMs require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^{T)(\mathbf{Qx})$} for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence $μ= 1/\sqrt{n}$--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete ${+1, -1}$ entries that are non-differentiable and prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. On LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.

Abstract PDF Upgrade to Chat

Summary

The paper introduces ButterflyQuant, a method that leverages learnable orthogonal butterfly transforms to perform ultra-low-bit quantization of large language models.
It employs a hierarchical composition of continuous Givens rotations to address layer-specific heterogeneity, reducing perplexity from 22.1 to 15.4 on the LLaMA-2-7B model.
Empirical results show the method retains about 88% of full-precision model accuracy on reasoning tasks, facilitating LLM deployment on consumer-grade hardware.

ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

Introduction

In "ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms," the authors address the challenge of reducing the memory footprint of LLMs to facilitate their deployment on consumer-grade hardware. The study introduces ButterflyQuant, a method leveraging learnable orthogonal butterfly transforms to perform ultra-low-bit quantization while mitigating performance degradation commonly associated with extreme quantization levels.

Challenges in LLM Quantization

LLMs typically demand significant memory resources, limiting their deployment on consumer hardware. Standard quantization techniques that aim to reduce numerical precision to 2-4 bits encounter severe performance loss, primarily attributed to the presence of outliers in activations. These outliers skew the dynamic range and impair low-bit compression. To counter this, rotation-based quantization methods have been devised, which apply orthogonal transforms to the activations before quantization.

Figure 1: Layer heterogeneity motivates learnable transforms for LLM quantization.

The paper identifies the heterogeneity across transformer layers—each exhibiting distinct outlier characteristics—as a critical issue. This layer-specific variability suggests that a one-size-fits-all transformation approach is suboptimal.

ButterflyQuant Approach

ButterflyQuant replaces fixed Hadamard rotations with learnable butterfly transforms, leveraging the flexibility of continuous Givens rotation angles for gradient-based optimization. Unlike Hadamard's discrete entries, the continuous nature of butterfly transforms allows them to adapt to the unique outlier distances of each layer while ensuring orthogonality, a crucial property for maintaining theoretical guarantees in outlier suppression.

Implementation Details

The butterfly transform is parameterized to enforce orthogonality and expressiveness while maintaining efficient $O(n \log n)$ complexity. This is achieved through a composition of Givens rotations, creating a sparse, hierarchical structure ideal for adaptation through gradient descent.

Rotation-based quantization methods such as QuaRot and QuIP apply predetermined orthogonal transforms to redistribute outliers across channels. In contrast, butterfly transforms are distinct in their ability to learn layer-specific adaptations while retaining computational efficiency and orthogonality.

Performance Evaluation

Empirically, ButterflyQuant successfully reduces models to a 2-bit quantization level, achieving significantly lower perplexity scores than competing state-of-the-art methods, which demonstrates its efficacy. For example, on the LLaMA-2-7B model, it reduces perplexity from 22.1 with QuaRot to 15.4. Additionally, it maintains approximately 88% of the full precision model's accuracy on various reasoning tasks.

Figure 2: Impact of initialization strategy on final perplexity.

The adaptability provided by learnable transforms allows ButterflyQuant to achieve superior performance, highlighting the importance of continuous parameterization in addressing layer heterogeneity.

Conclusion

ButterflyQuant presents a significant advancement in the quantization of LLMs. By introducing learnable orthogonal butterfly transforms, it addresses the limitations of fixed rotation strategies, offering an effective balance between theoretical guarantees and practical implementation efficiency. This methodology's ability to perform extreme compression with minimal performance loss opens pathways for more accessible deployment of complex models across a broader range of hardware. Future research may further explore extensions of this technique to more diverse architectures and its applications within other domains of AI model compression.