Microscaling FP4 Quantization
- Microscaling FP4 quantization is a technique that employs blockwise 4-bit floating-point encoding with local scaling to reduce memory footprint and accelerate AI computations.
- It integrates adaptive algorithms such as 4/6 block scaling and blockwise rotation to manage outliers and suppress quantization errors effectively.
- Hardware implementations on platforms like NVIDIA Blackwell achieve up to 4–5× speedup and maintain over 90% accuracy relative to FP16, supporting efficient inference and training.
Microscaling FP4 quantization refers to ultra-low-precision floating-point quantization schemes in which neural network tensors (weights, activations, gradients) are partitioned into small blocks—typically 16 or 32 elements per block—each sharing a local scale factor. The data values are encoded using 4-bit floating-point representations with minimal exponent/mantissa width. This quantization strategy is designed to enable high-throughput matrix multiplication on modern AI accelerators (notably NVIDIA Blackwell), facilitating up to 4–5× speedup and 2× memory reduction compared to FP16/FP8, while attempting to preserve task accuracy. The “microscaling” label arises from the use of blockwise scaling (fine dynamic range fitting at small granularity), which is crucial for suppressing outliers and maximizing effective range within the severely limited expressivity of FP4.
1. FP4 Microscale Formats and Encoding
Microscaling FP4 quantization is realized using hardware-native formats on platforms such as NVIDIA Blackwell and various NPU designs. The canonical element representation is E2M1 (1 sign bit, 2 exponent bits, 1 mantissa bit), with subnormals optionally supported. The dynamic range is determined by the exponent bias and mantissa encoding; for normalized values, , giving exact representable points (for NVFP4) or (for MXFP4) (Zhang et al., 16 May 2025, Cuyckens et al., 9 Nov 2025). Block scales are stored in either E4M3 (FP8, 4-exponent/3-mantissa) or E8M0 (exponent-only, power-of-two) formats. In the MXFP4 family, blocks contain 32 elements sharing a scale, while NVFP4 prefers size 16 for hardware alignment (Egiazarian et al., 27 Sep 2025, Zhang et al., 14 Jan 2026, Cook et al., 1 Dec 2025).
Quantization is performed per block: for a block , compute via . Each element is then quantized by dividing by , clipping, rounding to nearest FP4 code, and dequantizing by multiplying by . This yields highly compact representations while preserving sufficient numeric diversity for neural inference and training.
2. Block Scaling, Adaptive Schemes, and Outlier Management
Block scaling is fundamental in microscale FP4. Each block receives a dynamically computed scale factor aimed to map its largest magnitude to the highest FP4 representable value. However, with such aggressive quantization, blockwise outliers can dominate error statistics. In MXFP4, scales are forced to be pure powers-of-two (E8M0), sacrificing some optimality relative to unconstrained FP formats (Shao et al., 6 Nov 2025, Zhang et al., 14 Jan 2026). For NVFP4 (E4M3 block scales), adaptivity is possible (Cook et al., 1 Dec 2025).
Key algorithms include:
- Four Over Six (4/6) Adaptive Block Scaling: For each block, consider quantizing with scales that map the block’s maximum into either 6 or 4 (max FP4 code). Select the scale yielding minimum MSE. This targets blocks whose near-max FP4 rounding would destroy local accuracy, mitigating both training divergence and inference loss (Cook et al., 1 Dec 2025).
- Blockwise Rotation: Rather than applying global rotations to spread outliers—which is incompatible with PoT scaling—apply independent Hadamard or orthonormal transforms within each block. This preserves block-local statistics and supports optimal scale fitting, as in MR-GPTQ and BRQ methods (Shao et al., 6 Nov 2025, Egiazarian et al., 27 Sep 2025).
- Outlier-Preserved Microscaling: In settings prone to activation outlier destruction, retain outliers per block in full precision (BF16), quantizing only the remaining non-outlier elements. This strategy, realized in the OPAL accelerator pipeline, maintains accuracy while maximizing MXFP4 compute efficiency (Koo et al., 2024).
3. Quantization Algorithms, Rounding Strategies, and Stabilization
FP4 quantization quality depends not only on block scaling but also on quantization and rounding policy:
- Round-to-Nearest (RtN): Standard rounding yields deterministic quantization error, but in low-precision accumulation this error can bias model updates, especially in training (Chmiel et al., 25 May 2025, Hu et al., 22 Sep 2025).
- Stochastic Rounding (SR): Essential for unbiased gradient estimation. FP4 training strategies split rounding—RtN for forward passes, SR for backward and update passes—to ensure unbiased SGD and stable convergence (Chmiel et al., 25 May 2025, Hu et al., 22 Sep 2025).
- EMA Quantizer and Q-Ramping: For MXFP4 training, weight oscillation is controlled by using an EMA of the weight for quantization decisions or by adaptively ramping local learning rates/accumulation based on observed oscillation metrics (Chen et al., 28 Feb 2025).
- Hadamard Transformation: Blockwise orthogonal transforms (Hadamard) are employed to flatten the distribution before quantization, equalizing energy and rendering scale fitting more effective. This is central in Quartet and MR-GPTQ, as well as in HQ-DiT for vision (Castro et al., 20 May 2025, Egiazarian et al., 27 Sep 2025, Liu et al., 2024).
Quantization-aware fine-tuning and split precision scheduling further close the gap to full precision (Chmiel et al., 25 May 2025, Castro et al., 20 May 2025).
4. Hardware Implementations and Throughput Scaling
Microscaling FP4 is uniquely compatible with next-generation accelerator pipelines:
- Blackwell Tensor Cores: Full support for NVFP4 and MXFP4 block formats, fused scaling in GEMM, and specialized FP4–FP8 conversion for mixed-precision workflows. Throughput: up to 1 600 TOPS single FP4 matmul; end-to-end workflows at 1038 TOPS (SageAttention3, RTX5090) (Zhang et al., 16 May 2025).
- Precision-Scalable MACs: Hybrid reduction trees perform FP4 multiply-accumulates with relaxed precision, cutting area by , achieving 4065 GOPS/W at batch inference on SNAX NPU (Cuyckens et al., 9 Nov 2025).
- OPAL Accelerator: Outlier-preserved FP4 microscale units offer energy savings at PPL loss on LLMs, with adaptable integer and floating microblock units (Koo et al., 2024).
- Implementation Practices: Optimal block sizes (e.g., 16–32), fused kernel computation (CUTLASS), and scale clustering for memory bandwidth matching are required for maximized utilization (Liu et al., 4 Aug 2025, Egiazarian et al., 27 Sep 2025).
FP4 multipliers and adders occupy area comparable or even less than INT4 equivalents, obviating hardware adoption barriers (Cuyckens et al., 9 Nov 2025, Liu et al., 2023).
5. Empirical Benchmarks, Comparative Results, and Limitations
FP4 microscale quantization schemes have been systematically benchmarked:
- LLMs / Transformers: MR-GPTQ and Quartet both recover of FP16 score on Llama-3 models at W4A4 (Egiazarian et al., 27 Sep 2025, Castro et al., 20 May 2025). SageAttention3 achieves attention kernel speedup; FP4 All the Way and Quartet show end-to-end training speedup (Chmiel et al., 25 May 2025, Castro et al., 20 May 2025). ZeroQuant-FP reports FP4 weights $0.95$ PPL better than INT4+FP8 for LLaMA-7B (Wu et al., 2023).
- Diffusion Models: FP4 schemes (HQ-DiT, MSFP+TALoRA+DFA) achieve FID close to full precision with speedup relative to INT8/FP8 (Liu et al., 2024, Zhao et al., 27 May 2025).
- Vision Transformers: TetraJet MXFP4 training achieves reduction in accuracy loss compared to baseline, nearly matching FP32 (Chen et al., 28 Feb 2025).
- Limitations: MXFP4 (PoT block scales) suffers from scale-induced error unless optimized; NVFP4 (FP8 block scales) is more robust, especially under Four Over Six adaptive scaling (Zhang et al., 14 Jan 2026, Cook et al., 1 Dec 2025). The scaling factor is a critical bottleneck in MXFP4 (Zhang et al., 14 Jan 2026). Mixed precision and error compensation algorithms (AWQ, GPTQ, MR-GPTQ) close much of the performance gap (Egiazarian et al., 27 Sep 2025, Zhang et al., 14 Jan 2026).
Tables summarizing format and performance:
| Format | Block size | Scale format | Throughput (GOPS) | Accuracy recovery (vs FP16) |
|---|---|---|---|---|
| MXFP4 | 32 | E8M0 PoT | 512–1600 (B200/RTX5090) | 89–94% |
| NVFP4 | 16 | E4M3 | 1038–1600 | 96% |
6. Extensions, Mixed Precision, and Practical Guidelines
The mixed-precision paradigm—deciding channel/block precision by precomputed error thresholds—yields optimal trade-offs for each linear layer and supports customizable quantization pipeline design (Liu et al., 4 Aug 2025, Liu et al., 2023, Zhang et al., 16 May 2025). Best practices identified:
- Relative intra-group scale alignment (ZeroQuant-FP M2) for FP4–FP8 up-casting (Wu et al., 2023).
- Pre-scale optimization () for MXFP4 scales (Zhang et al., 14 Jan 2026).
- Blockwise rotation for compatibility with MXFP4 PoT scaling (Shao et al., 6 Nov 2025).
- Stochastic rounding for training/backward and deterministic for inference (Chmiel et al., 25 May 2025, Hu et al., 22 Sep 2025).
- Hadamard/block transform for outlier flattening prior to quantization (Castro et al., 20 May 2025, Egiazarian et al., 27 Sep 2025).
- Adaptive scaling selection (4/6 methods) for blocks with near-max errors (Cook et al., 1 Dec 2025).
- For activation quantization, per-channel exponent bias "microscaling" (LLM-FP4) yields near-full-precision accuracy for challenging transformer activations (Liu et al., 2023).
- Outlier preservation (retain BF16 values per block) in vision/generation tasks (Koo et al., 2024).
- Mixed format selection via MoFQ (select between FP4/INT4 per layer) achieves state-of-the-art PTQ results (Zhang et al., 2023).
7. Open Challenges and Future Directions
Despite substantial gains in speed, memory, and energy, substantial accuracy degradation still exists in aggressive MXFP4 settings, especially for extremely large LLMs or under pure power-of-two scaling (Zhang et al., 14 Jan 2026, Cook et al., 1 Dec 2025). Innovations such as adaptive block scaling, asymmetric scales (AMXFP4) (Lee et al., 2024), and blockwise rotation compensation methods offer promising mitigation. Further, the use of higher-precision scale formats (UE5M3, E4M3) has demonstrable benefits in balancing dynamic range with computation (Hu et al., 22 Sep 2025).
As hardware support for FP4 microscale formats becomes ubiquitous (Blackwell, SNAX, OPAL), algorithmic paradigms for quantization and training are rapidly adapting, favoring blockwise error compensation, mixed precision, and robust rounding techniques. The field continues to benchmark and refine methods across modalities (LLMs, vision, diffusion), with the trend toward training “FP4 all the way” established as both viable and highly efficient (Chmiel et al., 25 May 2025, Castro et al., 20 May 2025, Liu et al., 2023).
Microscaling FP4 quantization represents a convergence of low-bit hardware, block-adaptive quantization, and numeric optimization. It is now sufficiently matured for plug-and-play inference and end-to-end training in large foundation models, contingent on careful calibration of block size, scale format, and stabilization methodology. The performance-to-overhead Pareto frontier is governed as much by scaling, rounding, and error compensation choices as by raw bit-width. Robust implementations with empirical recovery exceeding 90% of FP16 performance are now available in open-source libraries and hardware pipelines spanning language, vision, and generative domains.