NormalFloat-4 (NF4): 4-bit Quantization
- NF4 is a 4-bit quantization method that uses fixed codebooks based on Gaussian quantiles to compress large language model weights with high fidelity.
- It employs a table-driven decoding process without explicit IEEE bit field partitioning, optimizing quantization error through nonuniform bin spacing.
- Empirical benchmarks show NF4 reduces memory footprint and preserves accuracy, making it ideal for memory-limited, GPU-centric deployments.
NormalFloat-4 (NF4) denotes a fixed codebook-based 4-bit floating-point quantization scheme engineered for compressing neural network weights—especially those in LLMs—by providing a data-efficient, distribution-matched, nonuniform mapping that prioritizes high fidelity in regions of statistical concentration. Unlike reduced IEEE floating-point types, NF4 forgoes explicit field partitioning (sign, exponent, mantissa) in favor of table-driven decoding, with codebook entries centered on equiprobable quantiles of the standard normal distribution. NF4 is algorithmically straightforward, exploits the near-Gaussian distributional structure prevalent in pretrained model parameters, and has demonstrated substantial accuracy advantages over uniform integer and canonical FP4 approaches. While not hardware-native, NF4 remains a leading software-backed format for memory-limited, GPU-centric deployments.
1. Mathematical Definition and Core Principles
NF4 represents each quantized value by an integer index into a table of 16 floating-point reproduction values . The codebook entries are the midpoints of 16 equiprobable bins under the standard normal CDF (), with one location reserved for exact zero (), yielding eight negative and seven non-negative (plus zero) quantiles. The canonical construction:
where is the inverse CDF. The codebook is fixed per tensor/model, requires no runtime calibration, and is tailored to reflect the statistical mass of normally distributed weights (Dettmers et al., 2023).
Unlike low-bit IEEE variants (INT4, FP4/E2M1) that partition bits and use geometric (floating-point-like) or uniform intervals, NF4’s nonuniform code spacing matches the empirical probability density, enabling finer granularity near zero—the region of maximum parameter concentration—and coarser resolution in the tails.
2. Quantization and Dequantization Workflow
NF4 quantization employs group-wise or block-wise scaling. For a group and weight , the process is:
- Scaling:
Typically, symmetric scaling with and is used. Alternative asymmetric variants set as the group's min value.
- Quantization:
Each scaled weight is assigned to the nearest code in the codebook.
- Dequantization:
The encoded value is mapped back onto the original scale by lookup and scaling.
Block sizes of 64–128 are typical, offering low overhead for side scales and high fidelity. No calibration pass is required for the codebook—clipping of out-of-range values is handled by scale selection and hard assignment to min/max codebook entries (Zhao et al., 2023, Elhoushi et al., 7 Jul 2025).
Pseudocode excerpt:
1 2 3 4 5 6 7 |
for block in blocks: scale = max(abs(block)) for w in block: normalized = clamp(w / scale, -1, 1) idx = argmin_i(abs(normalized - c[i])) quantized_vals.append(idx) scales.append(scale) |
3. Statistical Foundations and Information Theoretic Rationale
NF4’s optimality is predicated on the observation that pretrained transformer weights are close to Gaussian distributed (Dettmers et al., 2023). By allocating 4-bit codebook levels to the quantile midpoints, the scheme minimizes the mean squared quantization error (MSE) for i.i.d. normal data.
However, recent analysis indicates that blockwise absmax normalization alters the input distribution: the actual normalized weights delivered to the quantizer ( for block scale ) follow a distribution sharply peaked around zero, more so for large block sizes. Consequently, the true optimal codebook must account for this effect, especially as block size increases (Yoshida, 2023, Blumenberg et al., 10 May 2025). AbnormalFloat-4 (AF4) and block-wise optimal float (BOF4/BOF4-S) are subsequent developments that minimize quantization error by adaptively optimizing codebook locations over the actual block-normalized distribution, surpassing NF4 in certain metrics.
Nevertheless, for moderate block sizes (64–128), the degradation relative to NF4 remains minimal; NF4 retains practical near-optimality, as empirically confirmed (Yoshida, 2023).
4. Empirical Performance and Comparative Benchmarks
NF4 delivers superior language modeling accuracy and downstream task metrics compared to INT4 and FP4. Selected results:
- LLM (Llama3-8B, C4 perplexity, block size 128):
| Format | C4 Perplexity | |--------|---------------| | FP16 | 8.93 | | INT4 | 9.89 | | FP4 | 10.22 | | NF4 | 9.52 | | any4 | 9.40 |
NF4 closes most of the gap to full precision, any4 improves further by 0.1–0.2 (Elhoushi et al., 7 Jul 2025).
- ASR (Whisper, LibriSpeech, blockwise NF4):
| Model | WER (%) | |------------------|---------| | Whisper-FP32 | 10.02 | | Whisper-NF4 | 11.22 | | +LoRA Adaptation | 8.51 |
NF4 quantization followed by LoRA enables substantial parameter reduction with negligible, sometimes negative, accuracy loss (Zhao et al., 2023).
- Hardware metrics (A100, Llama2-7B, GPU memory):
| Format | Footprint (GB) | |--------|----------------| | NF4 | 4.58 | | FP4 | 4.58 | | INT8 | 7.82 | | BF16 | 13.53 |
Inference speed for NF4 approaches BF16, outperforming INT8 (Roy, 2023).
5. Implementation Details and Hardware Considerations
NF4 is predominantly software-implemented. Per-block scale factors are stored as float16/float32 (scales per block of 64–128), and quantized codes in packed 4-bit arrays. Dequantization at inference requires table lookup () and a fused multiply-add (FMA) per weight. SIMD vectorization mitigates latency bottlenecks.
Memory footprint reduction is significant (7× for weights, and 3× for scales when combined with double quantization (Dettmers et al., 2023)). Hardware, however, poses limitations: NF4 necessitates floating-point multipliers and accumulators (FP16/FP32) for the decode path; area and power thus exceed integer-centric formats (INT4, E2M1). NF4 is therefore best suited for GPU-bound workloads where memory bandwidth trumps area, as opposed to ASICs optimized for low-bit integer MACs (Dotzel et al., 2024).
6. Limitations, Controversies, and Subsequent Developments
NF4’s codebook, fixed by a Gaussian prior, does not adapt to per-layer/layerwise distributional shifts, nor does it optimally match blockwise normalized weight distributions when block size deviates from design assumptions. This has led to criticisms and refinements:
- Yoshida (Yoshida, 2023) demonstrated that NF4’s supposed “information-theoretic” optimality breaks under blockwise scaling; AF4 adapts codebooks to actual block-normalized distributions, lowering expected L₁ error especially at large block sizes.
- BOF4, BOF4-S, and OPQ further optimize codebook and normalization for MSE/MAE under empirical and theoretical distributions, systematically outperforming NF4 for large blocks and mixed-precision scenarios (Blumenberg et al., 10 May 2025).
- StudentFloat-4 (SF4), utilizing t-distribution quantiles rather than normal, better covers the heavier tails observed in empirical parameter statistics, yielding more robust accuracy for certain LLMs (Dotzel et al., 2024).
A plausible implication is that NF4 remains empirically effective for block sizes set at 64–128 (default in LLMs), while newer formats (AF4, BOF4, SF4) should be preferred when hardware limitations, model structure, or distributional properties diverge.
7. Practical Recommendations and Future Outlook
- NF4 is advised for GPU-based inferencing and finetuning with weights near normally distributed, and where deployment memory is constrained (Zhao et al., 2023, Dettmers et al., 2023, Elhoushi et al., 7 Jul 2025).
- For generative tasks on LLMs, use decoding temperature either 0.5 or to minimize repetition under NF4 quantization (Roy, 2023).
- In ASIC or other memory-efficient hardware, integer-like formats (E2M1, BOF4-S, E2M1+SP) often offer nearly equivalent accuracy at vastly reduced area and power, especially as block sizes increase or tail distributions become heavier (Dotzel et al., 2024, Blumenberg et al., 10 May 2025).
- Combining blockwise optimal quantization, signed normalization, and outlier-preserving mixed precision (OPQ) can close most fidelity gaps to full precision for arbitrarily large blocks.
NF4 established the principle of distribution-matched, table-driven 4-bit quantization for LLM deployment. Further advances build on this foundation, refining codebook optimization and normalization, and extending applicability to heterogeneous hardware and non-Gaussian statistical regimes.