NVFP4: 4-bit Lower-Precision Format
- NVFP4 is a 4-bit hardware-accelerated floating point data format that compresses and computes large-scale neural networks efficiently.
- It uses an E2M1 core with per-block E4M3 scaling and optional FP32 global scaling to achieve high dynamic range and reduced quantization error.
- NVFP4 delivers up to 6× arithmetic throughput and 50% memory reduction while maintaining near-FP16 accuracy in both training and inference.
NVFP4 Lower-Precision Format
NVFP4 is a hardware-accelerated, block-microscaled 4-bit floating point data format used to enable efficient, highly compressed storage and computation for large-scale deep neural networks, particularly LLMs. Backed by native support on NVIDIA Blackwell Tensor Cores, NVFP4 combines aggressive element-level quantization (E2M1 floating-point core, 4 bits/element) with local per-block floating-point scaling (E4M3, 8 bits per 16-element block) and an optional global floating-point scale, yielding favorable dynamic range, robust numerical stability, and practical accuracy in both training and inference settings (Chmiel et al., 25 May 2025, &&&1&&&, Cook et al., 1 Dec 2025, Chen et al., 31 Oct 2025, Meng et al., 12 Jan 2026, Panferov et al., 30 Jan 2026, Xin et al., 27 Jan 2026).
1. Format Definition and Bit-Level Structure
NVFP4 uses a “microscaled” mini-float architecture designed for dense hardware efficiency and high representational fidelity under stringent memory constraints. Each tensor is partitioned into blocks of 16 elements, each encoded as follows:
- FP4 Core (E2M1):
- 1 sign bit, 2 exponent bits (bias = 1), 1 mantissa bit
- Representable set:
- Value decoding: for normalized
- Block Scale (E4M3 FP8):
- Shared for 16 elements
- 1 sign bit (usually zero), 4-bit exponent (bias = 7), 3-bit mantissa
- Decodes as ; supports scale [1/128, 448]
- Optional global tensor scale: Full-precision (FP32)
- Block packing: Each 16-element block uses 64 bits (4b × 16) + 8 bits (scale) = 72 bits, or 4.5 bits/element.
This structure ensures that every block achieves local dynamic-range adaptation without the coarseness or loss typical of power-of-two only scaling (as in MXFP4) (Chmiel et al., 25 May 2025, NVIDIA et al., 29 Sep 2025, Hooper et al., 19 Apr 2025, Egiazarian et al., 27 Sep 2025, Cook et al., 1 Dec 2025).
2. Quantization and Dequantization Algorithms
a. Standard NVFP4 Quantization
Given a real-valued tensor :
- Global scale (optional): , where , .
- Block partition: Split into blocks of 16 elements.
- Per-block scale: For block , , quantized to E4M3.
- Block-wise quantization: For each in block :
Round to nearest FP4 value (or use stochastic rounding for gradients), store as 4-bit code .
- Dequantization: At GEMM time, recover .
The per-block scaling limits the impact of local outliers and avoids the quantization coarseness of larger groupings, with empirical results favoring group size 16 (Chmiel et al., 25 May 2025, NVIDIA et al., 29 Sep 2025).
b. Adaptive Quantization and Error Mitigation
- "Four Over Six" (4/6) Block Scaling: Rather than always scaling to FP4's maximal representable value (6), blocks also consider scaling to 4, selecting the regime with minimal per-block mean squared error (MSE). This adaptive scheme reduces quantization error for near-maximal values, crucial for LLMs where value distributions can be highly nonuniform (Cook et al., 1 Dec 2025, Panferov et al., 30 Jan 2026).
- Double-block scaling: Used in TetraJet-v2, combines a large “outer” block scale with conventional inner microscale for further dynamic range refinement (Chen et al., 31 Oct 2025).
- Hadamard/Rotation-based error spreading: Random Hadamard transforms (RHT) decorrelate outliers for enhanced quantization uniformity in blocks, particularly for gradient tensors (NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025, Panferov et al., 30 Jan 2026).
3. Rounding Schemes and Noise Analysis
NVFP4 deployments employ a hybrid of deterministic and unbiased quantization:
- Forward pass (inference and training): Deterministic round-to-nearest (RtN) for weights and activations. This minimizes quantization variance and is stable under repeated dot-product accumulation.
- Backward and update GEMMs (training): Stochastic rounding (SR) applied to gradients and update-activation tensors. SR ensures each quantized value is an unbiased estimator of the source, crucial for unbiased SGD (Chmiel et al., 25 May 2025, NVIDIA et al., 29 Sep 2025, Chen et al., 31 Oct 2025).
Quantization noise in SR is modeled as zero-mean, with variance (per-block, due to block scale). Theoretical analysis demonstrates effective training in NVFP4 is possible as long as gradient norms remain above ; when this threshold is crossed, switching to BF16 or FP8 is recommended (Chmiel et al., 25 May 2025).
Recent advances (MS-EDEN, Quartet II (Panferov et al., 30 Jan 2026)) achieve unbiased quantization with roughly half the mean squared error of elementwise SR, by applying group-scale corrections post-RHT and re-quantizing scales using stochastic rounding.
4. Empirical Performance and Model Training
Training Stability and Accuracy:
- NVFP4 enables full model training with GEMMs in 4-bit precision up to multi-billion parameter scales and extreme token counts (200B tokens and beyond) (Chmiel et al., 25 May 2025, NVIDIA et al., 29 Sep 2025).
- Initial loss gaps to BF16 are small (≈0.05–1.5% loss units); extended quantization-aware fine-tuning or combination with Four Over Six scaling closes the gap.
- Downstream task performance (MMLU, code, reading tasks) is nearly identical to BF16 or FP8, with empirical accuracy within 1–2% for most tasks (Chmiel et al., 25 May 2025, Cook et al., 1 Dec 2025, Meng et al., 12 Jan 2026).
Efficiency:
- NVFP4 provides 2–6× arithmetic throughput versus BF16/FP16 and up to 50% memory reduction for weights and optimizer states. Measured speedups span 2–4.2× in full-model benchmarks (NVIDIA et al., 29 Sep 2025, Panferov et al., 30 Jan 2026, Meng et al., 12 Jan 2026).
- Design tradeoffs: blocks smaller than 16 elements yield limited further accuracy gain; larger blocks or coarser (e.g., power-of-two) scales (MXFP4) sharply degrade accuracy for heavy-tailed or block-outlier distributions (Hooper et al., 19 Apr 2025, Egiazarian et al., 27 Sep 2025).
Advanced Techniques:
- Selective retention of high-precision layers (e.g., early/final transform blocks in BF16) can further stabilize very deep or hybrid models with minimal capacity cost (NVIDIA et al., 29 Sep 2025, Cook et al., 1 Dec 2025).
- NVFP4 is compatible with both fine-grained mixed-precision (FGMP) inference schemes, allowing critical channels or blocks to be promoted to FP8 as needed (Hooper et al., 19 Apr 2025).
5. Post-Training Quantization, Distillation, and Residual Compensation
Post-Training Quantizers and Compensation:
- NVFP4 supports classic PTQ (e.g., GPTQ, AWQ, SmoothQuant), but unique error profiles (notably, increased rounding error for near-maximal values) motivate new algorithms:
- 4/6 adaptive block scaling (Cook et al., 1 Dec 2025, Panferov et al., 30 Jan 2026) directly mitigates large-value error, improving perplexity and downstream accuracy in both pre-training and PTQ.
- ARCQuant (Meng et al., 12 Jan 2026) introduces Augmented Residual Channels: outlier dimensions are detected and encoded in additional residual “channels” (also NVFP4), enabling error-compensated GEMM computation with minimal latency increase.
- MR-GPTQ (Egiazarian et al., 27 Sep 2025) fuses small-block Hadamard transformations and blocked grid-search for scale optimization, reducing quantization-induced MSE.
Distillation for Accuracy Recovery:
- Quantization-aware distillation (QAD) (Xin et al., 27 Jan 2026) recovers or surpasses strong BF16 baselines by training the quantized model to match a reference full-precision model’s output distribution (via KL divergence loss), robustly bridging the small observed gaps left by even aggressive PTQ (including on multi-stage SFT+RL pipelines and vision-LLMs).
6. Hardware and Implementation Considerations
NVFP4 is intimately designed around the requirements and capabilities of modern NVIDIA architectures (Blackwell GPUs and beyond):
- Blackwell Tensor Cores natively implement 4-bit × 4-bit GEMMs using NVFP4 with 16-element blocks and full FP8 scaling, fused with scale computation, layout expansion, and stochastic rounding in hardware (NVIDIA et al., 29 Sep 2025, Chmiel et al., 25 May 2025, Panferov et al., 30 Jan 2026).
- Integration into software stacks (Transformer Engine, QuTLASS, etc.) provides seamless PyTorch training and inference support, fused quantize/dequantize kernels, and optional flags for advanced rounding or residual techniques (NVIDIA et al., 29 Sep 2025, Egiazarian et al., 27 Sep 2025, Meng et al., 12 Jan 2026).
- Hardware datapath extensions (e.g., VMAC-based PE arrays for FGMP) further minimize area and power for fine-grained mixed-precision routing (Hooper et al., 19 Apr 2025).
Empirical kernel benchmarks demonstrate that NVFP4-centric pipelines are memory-bound rather than compute-bound and yield end-to-end speedups of 2–4× compared to FP16/BF16 baselines.
7. Comparative Analysis, Limitations, and Future Directions
a. Comparative Summary
| Format | Elem Bits | Block Scale | Block Size | Util. Range | Avg Accuracy vs FP16 (%) | Speedup (SM) |
|---|---|---|---|---|---|---|
| MXFP4 | 4 | E8M0 | 32 | ~[–3,3] | 90–95 | 4–7× |
| NVFP4 | 4 | E4M3 | 16 | ~[–6,6] | 96–99 | 3–6× |
| INT4+FP16 | 4 | FP16 | 32 | [-8,7] | 92–95 | 2–3× |
*: Empirical accuracy varies by task/model size; speedup is relative to BF16/FP16 (Egiazarian et al., 27 Sep 2025, NVIDIA et al., 29 Sep 2025, Meng et al., 12 Jan 2026).
b. Limitations and Open Challenges
- NVFP4 is not universally lossless: extremely long token contexts or models with unbounded dynamic range may experience modest accuracy drops (<2%), particularly without 4/6 scaling or residual compensation (Cook et al., 1 Dec 2025, Meng et al., 12 Jan 2026, Panferov et al., 30 Jan 2026).
- The tradeoff between block size (hardware efficiency vs. representational error) is saturated at 16 elements; smaller blocks add storage overhead, larger degrade small-value quantization.
- Extension to all network layers (e.g., attention, embedding, output heads) remains constrained by stability and hardware uniformity restrictions (NVIDIA et al., 29 Sep 2025).
- While MR-GPTQ and QAD close much of the quantization gap, some model architectures (notably RL-fine-tuned LLMs) may be brittle to conventional QAT/QAF methods, with QAD currently recognized as most robust (Xin et al., 27 Jan 2026).
c. Research Directions
- Elimination or minimization of residual higher-precision layers
- Generalization of 2D scaling and Hadamard techniques to attention and communication subgraphs
- Exploration of sub-4-bit formats (e.g., integer-only 2-bit), contingent on task and distribution
- Systematic, automated block size and scale tuning
- Integration of advanced unbiased quantizers (e.g., MS-EDEN) (Panferov et al., 30 Jan 2026)
- Scaling of ARCQuant-type residual schemes to extremely large dimensions
In summary, NVFP4 stands as a rigorously defined, hardware-optimized, and empirically validated solution for sub-8-bit floating-point quantization, offering near-full-precision performance across LLM training, PTQ, and deployment at unprecedented computational and energy efficiency.