Papers
Topics
Authors
Recent
Search
2000 character limit reached

NVFP4: 4-bit Lower-Precision Format

Updated 2 February 2026
  • NVFP4 is a 4-bit hardware-accelerated floating point data format that compresses and computes large-scale neural networks efficiently.
  • It uses an E2M1 core with per-block E4M3 scaling and optional FP32 global scaling to achieve high dynamic range and reduced quantization error.
  • NVFP4 delivers up to 6× arithmetic throughput and 50% memory reduction while maintaining near-FP16 accuracy in both training and inference.

NVFP4 Lower-Precision Format

NVFP4 is a hardware-accelerated, block-microscaled 4-bit floating point data format used to enable efficient, highly compressed storage and computation for large-scale deep neural networks, particularly LLMs. Backed by native support on NVIDIA Blackwell Tensor Cores, NVFP4 combines aggressive element-level quantization (E2M1 floating-point core, 4 bits/element) with local per-block floating-point scaling (E4M3, 8 bits per 16-element block) and an optional global floating-point scale, yielding favorable dynamic range, robust numerical stability, and practical accuracy in both training and inference settings (Chmiel et al., 25 May 2025, &&&1&&&, Cook et al., 1 Dec 2025, Chen et al., 31 Oct 2025, Meng et al., 12 Jan 2026, Panferov et al., 30 Jan 2026, Xin et al., 27 Jan 2026).

1. Format Definition and Bit-Level Structure

NVFP4 uses a “microscaled” mini-float architecture designed for dense hardware efficiency and high representational fidelity under stringent memory constraints. Each tensor is partitioned into blocks of 16 elements, each encoded as follows:

  • FP4 Core (E2M1):
    • 1 sign bit, 2 exponent bits (bias = 1), 1 mantissa bit
    • Representable set: x{±0,±0.5,±1,±1.5,±2,±3,±4,±6}x \in \{\pm 0,\, \pm 0.5,\, \pm 1,\, \pm 1.5,\, \pm 2,\, \pm 3,\, \pm 4,\, \pm 6\}
    • Value decoding: x=(1)s2e1(1+m21)x = (-1)^s \, 2^{e-1} (1 + m\cdot 2^{-1}) for normalized e>0e > 0
  • Block Scale (E4M3 FP8):
    • Shared for 16 elements
    • 1 sign bit (usually zero), 4-bit exponent (bias = 7), 3-bit mantissa
    • Decodes as s=2E7(1+M/8)s = 2^{E-7} (1 + M/8); supports scale \approx [1/128, 448]
  • Optional global tensor scale: Full-precision (FP32)
  • Block packing: Each 16-element block uses 64 bits (4b × 16) + 8 bits (scale) = 72 bits, or 4.5 bits/element.

This structure ensures that every block achieves local dynamic-range adaptation without the coarseness or loss typical of power-of-two only scaling (as in MXFP4) (Chmiel et al., 25 May 2025, NVIDIA et al., 29 Sep 2025, Hooper et al., 19 Apr 2025, Egiazarian et al., 27 Sep 2025, Cook et al., 1 Dec 2025).

2. Quantization and Dequantization Algorithms

a. Standard NVFP4 Quantization

Given a real-valued tensor XX:

  1. Global scale (optional): S=maxiXi/(MFP4MFP8)S = \mathrm{max}_i|X_i|/(M_{\rm FP4} M_{\rm FP8}), where MFP4=6M_{\rm FP4}=6, MFP8448M_{\rm FP8}\approx 448.
  2. Block partition: Split XX into blocks of 16 elements.
  3. Per-block scale: For block bb, sb=maxibXi/MFP4s_b = \mathrm{max}_{i\in b}|X_i|/M_{FP4}, quantized to E4M3.
  4. Block-wise quantization: For each xix_i in block bb:

ri=xiSsbr_i = \frac{x_i}{S\cdot s_b}

Round rir_i to nearest FP4 value (or use stochastic rounding for gradients), store as 4-bit code qiq_i.

  1. Dequantization: At GEMM time, recover xiqisbSx_i \approx q_i \cdot s_b \cdot S.

The per-block scaling limits the impact of local outliers and avoids the quantization coarseness of larger groupings, with empirical results favoring group size 16 (Chmiel et al., 25 May 2025, NVIDIA et al., 29 Sep 2025).

b. Adaptive Quantization and Error Mitigation

3. Rounding Schemes and Noise Analysis

NVFP4 deployments employ a hybrid of deterministic and unbiased quantization:

  • Forward pass (inference and training): Deterministic round-to-nearest (RtN) for weights and activations. This minimizes quantization variance and is stable under repeated dot-product accumulation.
  • Backward and update GEMMs (training): Stochastic rounding (SR) applied to gradients and update-activation tensors. SR ensures each quantized value is an unbiased estimator of the source, crucial for unbiased SGD (Chmiel et al., 25 May 2025, NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025).

Quantization noise in SR is modeled as zero-mean, with variance σq2Δ2/12\sigma_q^2 \approx \Delta^2/12 (per-block, due to block scale). Theoretical analysis demonstrates effective training in NVFP4 is possible as long as gradient norms remain above 3σq/d\sqrt{3}\,\sigma_q/\sqrt{d}; when this threshold is crossed, switching to BF16 or FP8 is recommended (Chmiel et al., 25 May 2025).

Recent advances (MS-EDEN, Quartet II (Panferov et al., 30 Jan 2026)) achieve unbiased quantization with roughly half the mean squared error of elementwise SR, by applying group-scale corrections post-RHT and re-quantizing scales using stochastic rounding.

4. Empirical Performance and Model Training

Training Stability and Accuracy:

Efficiency:

Advanced Techniques:

5. Post-Training Quantization, Distillation, and Residual Compensation

Post-Training Quantizers and Compensation:

  • NVFP4 supports classic PTQ (e.g., GPTQ, AWQ, SmoothQuant), but unique error profiles (notably, increased rounding error for near-maximal values) motivate new algorithms:
    • 4/6 adaptive block scaling (Cook et al., 1 Dec 2025, Panferov et al., 30 Jan 2026) directly mitigates large-value error, improving perplexity and downstream accuracy in both pre-training and PTQ.
    • ARCQuant (Meng et al., 12 Jan 2026) introduces Augmented Residual Channels: outlier dimensions are detected and encoded in additional residual “channels” (also NVFP4), enabling error-compensated GEMM computation with minimal latency increase.
    • MR-GPTQ (Egiazarian et al., 27 Sep 2025) fuses small-block Hadamard transformations and blocked grid-search for scale optimization, reducing quantization-induced MSE.

Distillation for Accuracy Recovery:

  • Quantization-aware distillation (QAD) (Xin et al., 27 Jan 2026) recovers or surpasses strong BF16 baselines by training the quantized model to match a reference full-precision model’s output distribution (via KL divergence loss), robustly bridging the small observed gaps left by even aggressive PTQ (including on multi-stage SFT+RL pipelines and vision-LLMs).

6. Hardware and Implementation Considerations

NVFP4 is intimately designed around the requirements and capabilities of modern NVIDIA architectures (Blackwell GPUs and beyond):

Empirical kernel benchmarks demonstrate that NVFP4-centric pipelines are memory-bound rather than compute-bound and yield end-to-end speedups of 2–4× compared to FP16/BF16 baselines.

7. Comparative Analysis, Limitations, and Future Directions

a. Comparative Summary

Format Elem Bits Block Scale Block Size Util. Range Avg Accuracy vs FP16 (%) Speedup (SM)
MXFP4 4 E8M0 32 ~[–3,3] 90–95 4–7×
NVFP4 4 E4M3 16 ~[–6,6] 96–99 3–6×
INT4+FP16 4 FP16 32 [-8,7] 92–95 2–3×

*: Empirical accuracy varies by task/model size; speedup is relative to BF16/FP16 (Egiazarian et al., 27 Sep 2025, NVIDIA et al., 29 Sep 2025, Meng et al., 12 Jan 2026).

b. Limitations and Open Challenges

  • NVFP4 is not universally lossless: extremely long token contexts or models with unbounded dynamic range may experience modest accuracy drops (<2%), particularly without 4/6 scaling or residual compensation (Cook et al., 1 Dec 2025, Meng et al., 12 Jan 2026, Panferov et al., 30 Jan 2026).
  • The tradeoff between block size (hardware efficiency vs. representational error) is saturated at 16 elements; smaller blocks add storage overhead, larger degrade small-value quantization.
  • Extension to all network layers (e.g., attention, embedding, output heads) remains constrained by stability and hardware uniformity restrictions (NVIDIA et al., 29 Sep 2025).
  • While MR-GPTQ and QAD close much of the quantization gap, some model architectures (notably RL-fine-tuned LLMs) may be brittle to conventional QAT/QAF methods, with QAD currently recognized as most robust (Xin et al., 27 Jan 2026).

c. Research Directions

  • Elimination or minimization of residual higher-precision layers
  • Generalization of 2D scaling and Hadamard techniques to attention and communication subgraphs
  • Exploration of sub-4-bit formats (e.g., integer-only 2-bit), contingent on task and distribution
  • Systematic, automated block size and scale tuning
  • Integration of advanced unbiased quantizers (e.g., MS-EDEN) (Panferov et al., 30 Jan 2026)
  • Scaling of ARCQuant-type residual schemes to extremely large dimensions

In summary, NVFP4 stands as a rigorously defined, hardware-optimized, and empirically validated solution for sub-8-bit floating-point quantization, offering near-full-precision performance across LLM training, PTQ, and deployment at unprecedented computational and energy efficiency.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NVFP4 Lower-Precision Format.