BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs

Published 25 Apr 2025 in cs.CL and cs.LG | (2504.18415v2)

Abstract: Efficient deployment of 1-bit LLMs is hindered by activation outliers, which complicate quantization to low bit-widths. We introduce BitNet v2, a novel framework enabling native 4-bit activation quantization for 1-bit LLMs. To tackle outliers in attention and feed-forward network activations, we propose H-BitLinear, a module applying an online Hadamard transformation prior to activation quantization. This transformation smooths sharp activation distributions into more Gaussian-like forms, suitable for low-bit representation. Experiments show BitNet v2 trained from scratch with 8-bit activations matches BitNet b1.58 performance. Crucially, BitNet v2 achieves minimal performance degradation when trained with native 4-bit activations, significantly reducing memory footprint and computational cost for batched inference.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel Hadamard transformation module that reshapes non-Gaussian intermediate activations for effective 4-bit quantization in 1-bit LLMs.
It employs a two-stage training strategy, starting with 8-bit activations and continuing with native 4-bit activations to maintain performance while enhancing efficiency.
Experimental results show that the 4-bit variant achieves comparable perplexity and downstream task performance, outperforming other post-training quantization methods.

This paper, "BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs" (2504.18415), introduces a novel framework to address the challenge of quantizing activations in 1-bit LLMs to low bit-widths, specifically 4 bits, for improved inference efficiency on emerging hardware.

The core problem tackled is that while 1.58-bit weights (ternary: -1, 0, 1), as used in BitNet b1.58, significantly reduce memory bandwidth, the models often still rely on 8-bit activations. This limits the full utilization of hardware designed for 4-bit computations, shifting the bottleneck from memory bandwidth to computation. Aggressively quantizing activations to 4 bits is difficult because intermediate states within LLMs (outputs of attention output projection and FFN down projection) often have non-Gaussian distributions with significant outliers, which are challenging for low-bit fixed-point representations.

BitNet v2 proposes to enable native 4-bit activations across the entire model (except potentially input/output embeddings). The key innovation is a new module, $, which replaces the standard linear layers for attention output projections ($ \mathbf{W}\text{o} $) and FFN down projections ($ \mathbf{W}\text{down} $). The layer applies an online Hadamard transformation to the activations before they are quantized. The Hadamard transformation$ \mathbf{H_m} $is a$ 2^m \times 2^m $orthogonal matrix constructible via a recursive formula. For an input vector$ \mathbf{X} $of size$ n=2^m $, the transformation is$ \text{Hadamard}(\mathbf{X}) = \mathbf{H_m} \mathbf{X} $. The paper utilizes a fast Hadamard transform implementation with$ \mathcal{O}(n \log n) $complexity. The purpose of this transformation is to strategically reshape the distribution of the intermediate states. While inputs to attention and FFN layers are often naturally Gaussian-like, the intermediate states are characterized by sharp distributions and numerous outliers. The Hadamard transformation smooths these sharp distributions, making them more amenable to low-bit quantization, as illustrated by the activation distribution plots in the paper. The quantization scheme for BitNet v2 involves: <ul> <li>Weights: <a href="https://www.emergentmind.com/topics/1-58-bit-ternary-quantization" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">1.58-bit ternary quantization</a>$ \{-1, 0, 1\} $using a per-tensor absolute mean scaling factor: $ \text{Q}_{w}(\mathbf{W}) = \alpha\text{RoundClip}(\frac{\mathbf{W}}{\alpha+\epsilon}, -1, 1),\,\alpha = \text{mean}(|\mathbf{W}|) $</li> <li>Activations: <ul> <li>8-bit activations (for initial training and comparison) use per-token absmax quantization: $ \text{Q}_{\text{INT8}}(\mathbf{X}) = \frac{\gamma}{127}\text{RoundClip}(\frac{127}{\gamma+\epsilon}\mathbf{X}, -128, 127),\,\gamma = \max(|\mathbf{X}|) $</li> <li>4-bit activations (for efficient inference) use per-token absmean quantization: $ \text{Q}_\text{INT4}(\mathbf{X}) = \frac{\beta}{\sqrt{7}}\text{RoundClip}(\frac{\sqrt{7}}{\beta+\epsilon}\mathbf{X}, -8, 7),\,\beta = \text{mean}(|\mathbf{X}|) $</li> </ul></li> </ul> The computation within the layers (specifically$ \mathbf{W}_\text{o} $and$ \mathbf{W}_\text{down} $) is formulated as$ \mathbf{Y} = \text{Q}_{w}(\mathbf{W}) \cdot \text{Q}_{\text{INT8/4}}(\mathbf{X_r}) $, where$ \mathbf{X_r} = \text{Hadamard}(\text{LN}(\mathbf{X})) $and LN is <a href="https://www.emergentmind.com/topics/layer-normalization-ln" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Layer Normalization</a>. For other layers (e.g.,$ \mathbf{W}_\text{qkv} $,$ \mathbf{W}_\text{up,gate} $), the Hadamard transformation is not applied as their inputs already exhibit better distributions. BitNet v2 employs a <a href="https://www.emergentmind.com/topics/two-stage-training-strategy" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">two-stage training strategy</a>. Initially, the model is trained from scratch with 1.58-bit weights and 8-bit activations for a significant portion of the training tokens (e.g., 95B). Then, it undergoes a continue-training phase with native 4-bit activations for all linear layers (except embeddings) for a smaller number of tokens (e.g., 5B), reusing optimizer states. Training utilizes the <a href="https://www.emergentmind.com/topics/straight-through-estimator-ste" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Straight-Through Estimator</a> (STE) for gradient approximation and mixed-precision updates for full-precision latent weights. The backward pass for the Hadamard transformation leverages its orthogonality by applying the transformation to the gradients ($ \cfrac{\partial \mathcal{L}}{\partial \mathbf{X}} = \text{Hadamard}(\cfrac{\partial \mathcal{L}}{\partial\,\text{Hadamard}(\mathbf{X})})$).

Experimental results demonstrate the effectiveness of BitNet v2:

BitNet v2 trained with 8-bit activations (BitNet v2 (a8)) achieves performance comparable to or slightly better than BitNet b1.58 (W1.58A8), indicating that the insertion of the Hadamard transformation is not detrimental.
The 4-bit activation variant (BitNet v2 (a4)), obtained after continue-training, shows minimal performance degradation compared to its 8-bit counterpart and performs comparably to BitNet a4.8 (W1.58A4/A8 hybrid with sparsification) on perplexity and downstream tasks. Crucially, BitNet v2 (a4) uses dense 4-bit computations, making it more efficient for batched inference on hardware supporting native 4-bit operations compared to methods involving sparsification.
BitNet v2 (a4) significantly outperforms post-training quantization methods like SpinQuant and QuaRot when applied to BitNet b1.58 to achieve W1.58A4.
Ablation studies confirm that the Hadamard transformation is necessary for stable training with low-bit activations for intermediate states and that applying it only to activations is sufficient.
The QKV cache states can also be quantized to 3-bit or 4-bit in BitNet v2 with marginal performance impact.

In summary, BitNet v2 provides a practical architecture and training methodology to realize the efficiency benefits of 4-bit activations in 1-bit LLMs by incorporating a Hadamard transformation layer to condition intermediate state distributions for low-bit quantization. This enables significant memory and computational savings, particularly for batched inference on modern hardware.

Markdown Report Issue