Papers
Topics
Authors
Recent
Search
2000 character limit reached

Microscaling FP4 Quantization

Updated 16 January 2026
  • Microscaling FP4 quantization is a technique that employs blockwise 4-bit floating-point encoding with local scaling to reduce memory footprint and accelerate AI computations.
  • It integrates adaptive algorithms such as 4/6 block scaling and blockwise rotation to manage outliers and suppress quantization errors effectively.
  • Hardware implementations on platforms like NVIDIA Blackwell achieve up to 4–5× speedup and maintain over 90% accuracy relative to FP16, supporting efficient inference and training.

Microscaling FP4 quantization refers to ultra-low-precision floating-point quantization schemes in which neural network tensors (weights, activations, gradients) are partitioned into small blocks—typically 16 or 32 elements per block—each sharing a local scale factor. The data values are encoded using 4-bit floating-point representations with minimal exponent/mantissa width. This quantization strategy is designed to enable high-throughput matrix multiplication on modern AI accelerators (notably NVIDIA Blackwell), facilitating up to 4–5× speedup and 2× memory reduction compared to FP16/FP8, while attempting to preserve task accuracy. The “microscaling” label arises from the use of blockwise scaling (fine dynamic range fitting at small granularity), which is crucial for suppressing outliers and maximizing effective range within the severely limited expressivity of FP4.

1. FP4 Microscale Formats and Encoding

Microscaling FP4 quantization is realized using hardware-native formats on platforms such as NVIDIA Blackwell and various NPU designs. The canonical element representation is E2M1 (1 sign bit, 2 exponent bits, 1 mantissa bit), with subnormals optionally supported. The dynamic range is determined by the exponent bias and mantissa encoding; for normalized values, x=(1)s2eb(1+m/2)x = (-1)^s \cdot 2^{e-b} \cdot (1 + m/2), giving exact representable points {0.5,1.0,1.5,2.0,3.0,4.0,6.0}\{0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0\} (for NVFP4) or ±3\pm3 (for MXFP4) (Zhang et al., 16 May 2025, Cuyckens et al., 9 Nov 2025). Block scales are stored in either E4M3 (FP8, 4-exponent/3-mantissa) or E8M0 (exponent-only, power-of-two) formats. In the MXFP4 family, blocks contain 32 elements sharing a scale, while NVFP4 prefers size 16 for hardware alignment (Egiazarian et al., 27 Sep 2025, Zhang et al., 14 Jan 2026, Cook et al., 1 Dec 2025).

Quantization is performed per block: for a block X=(x1,,xB)X = (x_1,\ldots,x_B), compute ss via s=maxixi/Qmaxs = \max_i |x_i|/Q_{max}. Each element is then quantized by dividing by ss, clipping, rounding to nearest FP4 code, and dequantizing by multiplying by ss. This yields highly compact representations while preserving sufficient numeric diversity for neural inference and training.

2. Block Scaling, Adaptive Schemes, and Outlier Management

Block scaling is fundamental in microscale FP4. Each block receives a dynamically computed scale factor aimed to map its largest magnitude to the highest FP4 representable value. However, with such aggressive quantization, blockwise outliers can dominate error statistics. In MXFP4, scales are forced to be pure powers-of-two (E8M0), sacrificing some optimality relative to unconstrained FP formats (Shao et al., 6 Nov 2025, Zhang et al., 14 Jan 2026). For NVFP4 (E4M3 block scales), adaptivity is possible (Cook et al., 1 Dec 2025).

Key algorithms include:

  • Four Over Six (4/6) Adaptive Block Scaling: For each block, consider quantizing with scales that map the block’s maximum into either 6 or 4 (max FP4 code). Select the scale yielding minimum MSE. This targets blocks whose near-max FP4 rounding would destroy local accuracy, mitigating both training divergence and inference loss (Cook et al., 1 Dec 2025).
  • Blockwise Rotation: Rather than applying global rotations to spread outliers—which is incompatible with PoT scaling—apply independent Hadamard or orthonormal transforms within each block. This preserves block-local statistics and supports optimal scale fitting, as in MR-GPTQ and BRQ methods (Shao et al., 6 Nov 2025, Egiazarian et al., 27 Sep 2025).
  • Outlier-Preserved Microscaling: In settings prone to activation outlier destruction, retain nn outliers per block in full precision (BF16), quantizing only the remaining non-outlier elements. This strategy, realized in the OPAL accelerator pipeline, maintains accuracy while maximizing MXFP4 compute efficiency (Koo et al., 2024).

3. Quantization Algorithms, Rounding Strategies, and Stabilization

FP4 quantization quality depends not only on block scaling but also on quantization and rounding policy:

Quantization-aware fine-tuning and split precision scheduling further close the gap to full precision (Chmiel et al., 25 May 2025, Castro et al., 20 May 2025).

4. Hardware Implementations and Throughput Scaling

Microscaling FP4 is uniquely compatible with next-generation accelerator pipelines:

  • Blackwell Tensor Cores: Full support for NVFP4 and MXFP4 block formats, fused scaling in GEMM, and specialized FP4–FP8 conversion for mixed-precision workflows. Throughput: up to 1 600 TOPS single FP4 matmul; end-to-end workflows at 1038 TOPS (SageAttention3, RTX5090) (Zhang et al., 16 May 2025).
  • Precision-Scalable MACs: Hybrid reduction trees perform FP4 multiply-accumulates with relaxed precision, cutting area by 2.43.1×2.4{-}3.1\times, achieving 4065 GOPS/W at batch inference on SNAX NPU (Cuyckens et al., 9 Nov 2025).
  • OPAL Accelerator: Outlier-preserved FP4 microscale units offer 1.62.2×1.6{-}2.2\times energy savings at <1<1 PPL loss on LLMs, with adaptable integer and floating microblock units (Koo et al., 2024).
  • Implementation Practices: Optimal block sizes (e.g., 16–32), fused kernel computation (CUTLASS), and scale clustering for memory bandwidth matching are required for maximized utilization (Liu et al., 4 Aug 2025, Egiazarian et al., 27 Sep 2025).

FP4 multipliers and adders occupy area comparable or even less than INT4 equivalents, obviating hardware adoption barriers (Cuyckens et al., 9 Nov 2025, Liu et al., 2023).

5. Empirical Benchmarks, Comparative Results, and Limitations

FP4 microscale quantization schemes have been systematically benchmarked:

Tables summarizing format and performance:

Format Block size Scale format Throughput (GOPS) Accuracy recovery (vs FP16)
MXFP4 32 E8M0 PoT 512–1600 (B200/RTX5090) 89–94%
NVFP4 16 E4M3 1038–1600 96%

6. Extensions, Mixed Precision, and Practical Guidelines

The mixed-precision paradigm—deciding channel/block precision by precomputed error thresholds—yields optimal trade-offs for each linear layer and supports customizable quantization pipeline design (Liu et al., 4 Aug 2025, Liu et al., 2023, Zhang et al., 16 May 2025). Best practices identified:

7. Open Challenges and Future Directions

Despite substantial gains in speed, memory, and energy, substantial accuracy degradation still exists in aggressive MXFP4 settings, especially for extremely large LLMs or under pure power-of-two scaling (Zhang et al., 14 Jan 2026, Cook et al., 1 Dec 2025). Innovations such as adaptive block scaling, asymmetric scales (AMXFP4) (Lee et al., 2024), and blockwise rotation compensation methods offer promising mitigation. Further, the use of higher-precision scale formats (UE5M3, E4M3) has demonstrable benefits in balancing dynamic range with computation (Hu et al., 22 Sep 2025).

As hardware support for FP4 microscale formats becomes ubiquitous (Blackwell, SNAX, OPAL), algorithmic paradigms for quantization and training are rapidly adapting, favoring blockwise error compensation, mixed precision, and robust rounding techniques. The field continues to benchmark and refine methods across modalities (LLMs, vision, diffusion), with the trend toward training “FP4 all the way” established as both viable and highly efficient (Chmiel et al., 25 May 2025, Castro et al., 20 May 2025, Liu et al., 2023).


Microscaling FP4 quantization represents a convergence of low-bit hardware, block-adaptive quantization, and numeric optimization. It is now sufficiently matured for plug-and-play inference and end-to-end training in large foundation models, contingent on careful calibration of block size, scale format, and stabilization methodology. The performance-to-overhead Pareto frontier is governed as much by scaling, rounding, and error compensation choices as by raw bit-width. Robust implementations with empirical recovery exceeding 90% of FP16 performance are now available in open-source libraries and hardware pipelines spanning language, vision, and generative domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Microscaling FP4 Quantization.