Papers
Topics
Authors
Recent
Search
2000 character limit reached

BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

Published 25 Apr 2025 in cs.CL and cs.LG | (2504.18415v2)

Abstract: Efficient deployment of 1-bit LLMs is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.

Summary

  • The paper presents a novel Hadamard transformation module that reshapes non-Gaussian intermediate activations for effective 4-bit quantization in 1-bit LLMs.
  • It employs a two-stage training strategy, starting with 8-bit activations and continuing with native 4-bit activations to maintain performance while enhancing efficiency.
  • Experimental results show that the 4-bit variant achieves comparable perplexity and downstream task performance, outperforming other post-training quantization methods.

This paper, "BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs" (2504.18415), introduces a novel framework to address the challenge of quantizing activations in 1-bit LLMs to low bit-widths, specifically 4 bits, for improved inference efficiency on emerging hardware.

The core problem tackled is that while 1.58-bit weights (ternary: -1, 0, 1), as used in BitNet b1.58, significantly reduce memory bandwidth, the models often still rely on 8-bit activations. This limits the full utilization of hardware designed for 4-bit computations, shifting the bottleneck from memory bandwidth to computation. Aggressively quantizing activations to 4 bits is difficult because intermediate states within LLMs (outputs of attention output projection and FFN down projection) often have non-Gaussian distributions with significant outliers, which are challenging for low-bit fixed-point representations.

BitNet v2 proposes to enable native 4-bit activations across the entire model (except potentially input/output embeddings). The key innovation is a new module, ,whichreplacesthestandardlinearlayersforattentionoutputprojections(, which replaces the standard linear layers for attention output projections (\mathbf{W}\text{o})andFFNdownprojections() and FFN down projections (\mathbf{W}\text{down}).ThelayerappliesanonlineHadamardtransformationtotheactivations<em>before</em>theyarequantized.</p><p>TheHadamardtransformation). The layer applies an online Hadamard transformation to the activations <em>before</em> they are quantized.</p> <p>The Hadamard transformation \mathbf{H_m}isa is a 2^m \times 2^morthogonalmatrixconstructibleviaarecursiveformula.Foraninputvector orthogonal matrix constructible via a recursive formula. For an input vector \mathbf{X}ofsize of size n=2^m,thetransformationis, the transformation is \text{Hadamard}(\mathbf{X}) = \mathbf{H_m} \mathbf{X}.ThepaperutilizesafastHadamardtransformimplementationwith. The paper utilizes a fast Hadamard transform implementation with \mathcal{O}(n \log n)complexity.Thepurposeofthistransformationistostrategicallyreshapethedistributionoftheintermediatestates.WhileinputstoattentionandFFNlayersareoftennaturallyGaussian−like,theintermediatestatesarecharacterizedbysharpdistributionsandnumerousoutliers.TheHadamardtransformationsmoothsthesesharpdistributions,makingthemmoreamenabletolow−bitquantization,asillustratedbytheactivationdistributionplotsinthepaper.</p><p>ThequantizationschemeforBitNetv2involves:</p><ul><li><strong>Weights:</strong><ahref="https://www.emergentmind.com/topics/1−58−bit−ternary−quantization"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">1.58−bitternaryquantization</a> complexity. The purpose of this transformation is to strategically reshape the distribution of the intermediate states. While inputs to attention and FFN layers are often naturally Gaussian-like, the intermediate states are characterized by sharp distributions and numerous outliers. The Hadamard transformation smooths these sharp distributions, making them more amenable to low-bit quantization, as illustrated by the activation distribution plots in the paper.</p> <p>The quantization scheme for BitNet v2 involves:</p> <ul> <li><strong>Weights:</strong> <a href="https://www.emergentmind.com/topics/1-58-bit-ternary-quantization" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">1.58-bit ternary quantization</a> \{-1, 0, 1\}usingaper−tensorabsolutemeanscalingfactor:</p><p> using a per-tensor absolute mean scaling factor:</p> <p>\text{Q}_{w}(\mathbf{W}) = \alpha\text{RoundClip}(\frac{\mathbf{W}}{\alpha+\epsilon}, -1, 1),\,\alpha = \text{mean}(|\mathbf{W}|)</li><li><strong>Activations:</strong><ul><li><p>8−bitactivations(forinitialtrainingandcomparison)useper−tokenabsmaxquantization:</p><p></li> <li><strong>Activations:</strong> <ul> <li><p>8-bit activations (for initial training and comparison) use per-token absmax quantization:</p> <p>\text{Q}_{\text{INT8}}(\mathbf{X}) = \frac{\gamma}{127}\text{RoundClip}(\frac{127}{\gamma+\epsilon}\mathbf{X}, -128, 127),\,\gamma = \max(|\mathbf{X}|)</p></li><li><p>4−bitactivations(forefficientinference)useper−tokenabsmeanquantization:</p><p></p></li> <li><p>4-bit activations (for efficient inference) use per-token absmean quantization:</p> <p>\text{Q}_\text{INT4}(\mathbf{X}) = \frac{\beta}{\sqrt{7}}\text{RoundClip}(\frac{\sqrt{7}}{\beta+\epsilon}\mathbf{X}, -8, 7),\,\beta = \text{mean}(|\mathbf{X}|)</p></li></ul></li></ul><p>Thecomputationwithinthelayers(specifically</p></li> </ul></li> </ul> <p>The computation within the layers (specifically \mathbf{W}_\text{o}and and \mathbf{W}_\text{down})isformulatedas) is formulated as \mathbf{Y} = \text{Q}_{w}(\mathbf{W}) \cdot \text{Q}_{\text{INT8/4}}(\mathbf{X_r}),where, where \mathbf{X_r} = \text{Hadamard}(\text{LN}(\mathbf{X}))andLNis<ahref="https://www.emergentmind.com/topics/layer−normalization−ln"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">LayerNormalization</a>.Forotherlayers(e.g., and LN is <a href="https://www.emergentmind.com/topics/layer-normalization-ln" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Layer Normalization</a>. For other layers (e.g., \mathbf{W}_\text{qkv},, \mathbf{W}_\text{up,gate}),theHadamardtransformationisnotappliedastheirinputsalreadyexhibitbetterdistributions.</p><p>BitNetv2employsa<ahref="https://www.emergentmind.com/topics/two−stage−training−strategy"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">two−stagetrainingstrategy</a>.Initially,themodelistrainedfromscratchwith1.58−bitweightsand8−bitactivationsforasignificantportionofthetrainingtokens(e.g.,95B).Then,itundergoesacontinue−trainingphasewithnative4−bitactivationsforalllinearlayers(exceptembeddings)forasmallernumberoftokens(e.g.,5B),reusingoptimizerstates.Trainingutilizesthe<ahref="https://www.emergentmind.com/topics/straight−through−estimator−ste"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Straight−ThroughEstimator</a>(STE)forgradientapproximationandmixed−precisionupdatesforfull−precisionlatentweights.ThebackwardpassfortheHadamardtransformationleveragesitsorthogonalitybyapplyingthetransformationtothegradients(), the Hadamard transformation is not applied as their inputs already exhibit better distributions.</p> <p>BitNet v2 employs a <a href="https://www.emergentmind.com/topics/two-stage-training-strategy" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">two-stage training strategy</a>. Initially, the model is trained from scratch with 1.58-bit weights and 8-bit activations for a significant portion of the training tokens (e.g., 95B). Then, it undergoes a continue-training phase with native 4-bit activations for all linear layers (except embeddings) for a smaller number of tokens (e.g., 5B), reusing optimizer states. Training utilizes the <a href="https://www.emergentmind.com/topics/straight-through-estimator-ste" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Straight-Through Estimator</a> (STE) for gradient approximation and mixed-precision updates for full-precision latent weights. The backward pass for the Hadamard transformation leverages its orthogonality by applying the transformation to the gradients (\cfrac{\partial \mathcal{L}}{\partial \mathbf{X}} = \text{Hadamard}(\cfrac{\partial \mathcal{L}}{\partial\,\text{Hadamard}(\mathbf{X})})$).

Experimental results demonstrate the effectiveness of BitNet v2:

  • BitNet v2 trained with 8-bit activations (BitNet v2 (a8)) achieves performance comparable to or slightly better than BitNet b1.58 (W1.58A8), indicating that the insertion of the Hadamard transformation is not detrimental.
  • The 4-bit activation variant (BitNet v2 (a4)), obtained after continue-training, shows minimal performance degradation compared to its 8-bit counterpart and performs comparably to BitNet a4.8 (W1.58A4/A8 hybrid with sparsification) on perplexity and downstream tasks. Crucially, BitNet v2 (a4) uses dense 4-bit computations, making it more efficient for batched inference on hardware supporting native 4-bit operations compared to methods involving sparsification.
  • BitNet v2 (a4) significantly outperforms post-training quantization methods like SpinQuant and QuaRot when applied to BitNet b1.58 to achieve W1.58A4.
  • Ablation studies confirm that the Hadamard transformation is necessary for stable training with low-bit activations for intermediate states and that applying it only to activations is sufficient.
  • The QKV cache states can also be quantized to 3-bit or 4-bit in BitNet v2 with marginal performance impact.

In summary, BitNet v2 provides a practical architecture and training methodology to realize the efficiency benefits of 4-bit activations in 1-bit LLMs by incorporating a Hadamard transformation layer to condition intermediate state distributions for low-bit quantization. This enables significant memory and computational savings, particularly for batched inference on modern hardware.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 37 likes about this paper.