- The paper presents a novel Hadamard transformation module that reshapes non-Gaussian intermediate activations for effective 4-bit quantization in 1-bit LLMs.
- It employs a two-stage training strategy, starting with 8-bit activations and continuing with native 4-bit activations to maintain performance while enhancing efficiency.
- Experimental results show that the 4-bit variant achieves comparable perplexity and downstream task performance, outperforming other post-training quantization methods.
This paper, "BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs" (2504.18415), introduces a novel framework to address the challenge of quantizing activations in 1-bit LLMs to low bit-widths, specifically 4 bits, for improved inference efficiency on emerging hardware.
The core problem tackled is that while 1.58-bit weights (ternary: -1, 0, 1), as used in BitNet b1.58, significantly reduce memory bandwidth, the models often still rely on 8-bit activations. This limits the full utilization of hardware designed for 4-bit computations, shifting the bottleneck from memory bandwidth to computation. Aggressively quantizing activations to 4 bits is difficult because intermediate states within LLMs (outputs of attention output projection and FFN down projection) often have non-Gaussian distributions with significant outliers, which are challenging for low-bit fixed-point representations.
BitNet v2 proposes to enable native 4-bit activations across the entire model (except potentially input/output embeddings). The key innovation is a new module, ,whichreplacesthestandardlinearlayersforattentionoutputprojections(\mathbf{W}\text{o})andFFNdownprojections(\mathbf{W}\text{down}).ThelayerappliesanonlineHadamardtransformationtotheactivations<em>before</em>theyarequantized.</p><p>TheHadamardtransformation\mathbf{H_m}isa2^m \times 2^morthogonalmatrixconstructibleviaarecursiveformula.Foraninputvector\mathbf{X}ofsizen=2^m,thetransformationis\text{Hadamard}(\mathbf{X}) = \mathbf{H_m} \mathbf{X}.ThepaperutilizesafastHadamardtransformimplementationwith\mathcal{O}(n \log n)complexity.Thepurposeofthistransformationistostrategicallyreshapethedistributionoftheintermediatestates.WhileinputstoattentionandFFNlayersareoftennaturallyGaussian−like,theintermediatestatesarecharacterizedbysharpdistributionsandnumerousoutliers.TheHadamardtransformationsmoothsthesesharpdistributions,makingthemmoreamenabletolow−bitquantization,asillustratedbytheactivationdistributionplotsinthepaper.</p><p>ThequantizationschemeforBitNetv2involves:</p><ul><li><strong>Weights:</strong><ahref="https://www.emergentmind.com/topics/1−58−bit−ternary−quantization"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">1.58−bitternaryquantization</a>\{-1, 0, 1\}usingaper−tensorabsolutemeanscalingfactor:</p><p>\text{Q}_{w}(\mathbf{W}) = \alpha\text{RoundClip}(\frac{\mathbf{W}}{\alpha+\epsilon}, -1, 1),\,\alpha = \text{mean}(|\mathbf{W}|)</li><li><strong>Activations:</strong><ul><li><p>8−bitactivations(forinitialtrainingandcomparison)useper−tokenabsmaxquantization:</p><p>\text{Q}_{\text{INT8}}(\mathbf{X}) = \frac{\gamma}{127}\text{RoundClip}(\frac{127}{\gamma+\epsilon}\mathbf{X}, -128, 127),\,\gamma = \max(|\mathbf{X}|)</p></li><li><p>4−bitactivations(forefficientinference)useper−tokenabsmeanquantization:</p><p>\text{Q}_\text{INT4}(\mathbf{X}) = \frac{\beta}{\sqrt{7}}\text{RoundClip}(\frac{\sqrt{7}}{\beta+\epsilon}\mathbf{X}, -8, 7),\,\beta = \text{mean}(|\mathbf{X}|)</p></li></ul></li></ul><p>Thecomputationwithinthelayers(specifically\mathbf{W}_\text{o}and\mathbf{W}_\text{down})isformulatedas\mathbf{Y} = \text{Q}_{w}(\mathbf{W}) \cdot \text{Q}_{\text{INT8/4}}(\mathbf{X_r}),where\mathbf{X_r} = \text{Hadamard}(\text{LN}(\mathbf{X}))andLNis<ahref="https://www.emergentmind.com/topics/layer−normalization−ln"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">LayerNormalization</a>.Forotherlayers(e.g.,\mathbf{W}_\text{qkv},\mathbf{W}_\text{up,gate}),theHadamardtransformationisnotappliedastheirinputsalreadyexhibitbetterdistributions.</p><p>BitNetv2employsa<ahref="https://www.emergentmind.com/topics/two−stage−training−strategy"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">two−stagetrainingstrategy</a>.Initially,themodelistrainedfromscratchwith1.58−bitweightsand8−bitactivationsforasignificantportionofthetrainingtokens(e.g.,95B).Then,itundergoesacontinue−trainingphasewithnative4−bitactivationsforalllinearlayers(exceptembeddings)forasmallernumberoftokens(e.g.,5B),reusingoptimizerstates.Trainingutilizesthe<ahref="https://www.emergentmind.com/topics/straight−through−estimator−ste"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Straight−ThroughEstimator</a>(STE)forgradientapproximationandmixed−precisionupdatesforfull−precisionlatentweights.ThebackwardpassfortheHadamardtransformationleveragesitsorthogonalitybyapplyingthetransformationtothegradients(\cfrac{\partial \mathcal{L}}{\partial \mathbf{X}} = \text{Hadamard}(\cfrac{\partial \mathcal{L}}{\partial\,\text{Hadamard}(\mathbf{X})})$).
Experimental results demonstrate the effectiveness of BitNet v2:
- BitNet v2 trained with 8-bit activations (BitNet v2 (a8)) achieves performance comparable to or slightly better than BitNet b1.58 (W1.58A8), indicating that the insertion of the Hadamard transformation is not detrimental.
- The 4-bit activation variant (BitNet v2 (a4)), obtained after continue-training, shows minimal performance degradation compared to its 8-bit counterpart and performs comparably to BitNet a4.8 (W1.58A4/A8 hybrid with sparsification) on perplexity and downstream tasks. Crucially, BitNet v2 (a4) uses dense 4-bit computations, making it more efficient for batched inference on hardware supporting native 4-bit operations compared to methods involving sparsification.
- BitNet v2 (a4) significantly outperforms post-training quantization methods like SpinQuant and QuaRot when applied to BitNet b1.58 to achieve W1.58A4.
- Ablation studies confirm that the Hadamard transformation is necessary for stable training with low-bit activations for intermediate states and that applying it only to activations is sufficient.
- The QKV cache states can also be quantized to 3-bit or 4-bit in BitNet v2 with marginal performance impact.
In summary, BitNet v2 provides a practical architecture and training methodology to realize the efficiency benefits of 4-bit activations in 1-bit LLMs by incorporating a Hadamard transformation layer to condition intermediate state distributions for low-bit quantization. This enables significant memory and computational savings, particularly for batched inference on modern hardware.