Papers
Topics
Authors
Recent
Search
2000 character limit reached

Microscaling Floating-Point (MXFP) Overview

Updated 21 January 2026
  • Microscaling Floating-Point (MXFP) is a low-precision representation using block-wise mini-floats with a shared scale to optimize dynamic range and quantization accuracy.
  • It reduces memory usage and enhances compute efficiency, making it ideal for AI accelerators in large language models and edge devices.
  • MXFP offers versatile format variants that balance quantization error and hardware performance, enabling drop-in integration with advanced quantization techniques.

Microscaling Floating-Point (MXFP) is a family of block-wise low-precision floating-point representations that combine compact per-element storage with a shared scale across small groups, enabling high dynamic range, fine-grained quantization, and efficient hardware implementation. MXFP formats are central to modern AI accelerators, especially in contexts demanding aggressive quantization (4–8 bits) for LLMs and edge devices, where both memory and compute resources are at a premium. The MXFP principle generalizes block floating-point by integrating low-bitwidth per-element “mini-floats” and a shared, typically 8-bit, exponent per block, thus achieving better accuracy-per-bit than uniform per-tensor scaling schemes and standard low-precision floating point.

1. Formal Definition, Bit Structure, and Encoding

A typical MXFP block contains kk elements (standard: k=32k=32), each represented as a low-bit IEEE-style mini-float with $1$ sign, ee exponent, and mm mantissa bits. All kk elements share a single EE-bit block exponent or scale, often realized as an 8-bit E8M0 power-of-two factor. The canonical element formats are:

  • MXFP8: E5M2 or E4M3 (total 8 bits)
  • MXFP6: E3M2 or E2M3 (total 6 bits)
  • MXFP4: E2M1 (total 4 bits)

Individual element viv_i in block BB with shared scale XX is decoded as: vi=X[(1)si  2eibias(1+mi/2m)]v_i = X \cdot \left[ (-1)^{s_i}\;2^{e_i - \text{bias}} (1 + m_i / 2^m) \right] where sis_i is the sign, eie_i the local exponent, mim_i the mantissa of element ii, and “bias” is determined by the element exponent width.

Encoding involves:

  1. For block BB of input values {xj}\{x_j\}, compute emax=maxjlog2xje_{\max} = \max_j \lfloor \log_2 |x_j| \rfloor.
  2. Set X=2emaxemax,elemX = 2^{e_{\max} - e_{\max,\text{elem}}}, where emax,eleme_{\max,\text{elem}} is the maximum normal exponent for the element format.
  3. Quantize each element xjx_j as qj=quantize(xj/X)q_j = \text{quantize}(x_j / X) using round-to-nearest or other policies, clamping overflows.

This process selects the block scale to maximize range utilization and minimizes quantization error within each block (Rouhani et al., 2023, Lee et al., 16 Oct 2025, Cuyckens et al., 9 Nov 2025).

2. Rationale, Numerical Properties, and Format Variants

MXFP aims to strike a balance among dynamic range, quantization precision, and hardware efficiency. The key advantages are:

  • Dynamic Range: The shared 8-bit block scale provides an effective dynamic range matching or exceeding FP32, mitigating underflow/overflow even at 4–8 bits per element.
  • Precision Tuning: Per-element mini-floats admit fine-grained trade-off between exponent (range) and mantissa (local precision), with the OCP standard supporting E5M2, E4M3, E3M2, E2M3, and E2M1 layouts.
  • Drop-in Integration: The shared scale and per-block quantization can be implemented as a drop-in replacement for FP32/FP16 tensors and dot-product kernels, requiring minimal code changes (Rouhani et al., 2023, Cuyckens et al., 9 Nov 2025).

A summary of typical MXFP format characteristics:

Format Bits/Elem Exponent Mantissa Block Size Shared Scale
MXFP8-E5M2 8 5 2 32 E8M0 (PowerOf2)
MXFP8-E4M3 8 4 3 32 E8M0 (PowerOf2)
MXFP6-E3M2 6 3 2 32 E8M0 (PowerOf2)
MXFP6-E2M3 6 2 3 32 E8M0 (PowerOf2)
MXFP4-E2M1 4 2 1 32 E8M0 (PowerOf2)

Each block’s effective bits per element is (e+m+1)+E/k(e+m+1) + E / k.

3. Quantization, Outlier Handling, and Format Extensions

MXFP uses a block-wise absolute-max quantization strategy; the block scale is chosen so the largest element fits the mini-float range, and all other elements are normalized accordingly. This provides robust outlier suppression, essential for uncalibrated post-training quantization (PTQ) and aggressive quantization in LLMs and vision transformers (Lee et al., 2024, Egiazarian et al., 27 Sep 2025, Lee et al., 16 Oct 2025).

However, small mantissa widths in very low-bit MXFP (e.g., MXFP4) induce large block-wise quantization error, particularly when blocks contain outliers or exhibit strong within-block asymmetry. Recent extensions, such as MX+ (Lee et al., 16 Oct 2025), address this by granting the block-max element increased precision (recycling its unused exponent bits as additional mantissa) and storing its index in block metadata, reducing block-max quantization error by up to 4×4\times.

The AMXFP4 (asymmetric MXFP4) format further introduces per-block asymmetric scaling—independently scaling positive and negative elements—addressing group-wise asymmetry created by “micro-grouping” in LLM activations. This yields accuracy gains over both symmetric MXFP4 and rotation-based INT4 methods (Lee et al., 2024).

Other format variants include adaptive hybrid block BFP/MXFP (Nanoscaling/NxFP with NanoMantissa, Adaptive Microexponent, and Code Recycling) enabling further memory savings and quantization-error reduction (Lo et al., 2024).

4. Hardware Implementation and Efficiency

MXFP formats are specifically tailored for efficient hardware realization on ASICs and FPGAs (Cuyckens et al., 9 Nov 2025, Gorodecky et al., 2024, İslamoğlu et al., 19 May 2025). Key microarchitectural features include:

  • Aligned Block Access: Simultaneous processing of kk elements with a single shared scale register per block maximizes data-path utilization.
  • Simple Scaling: E8M0 scaling reduces to shift operations, avoiding general multipliers for block rescaling and dequantization.
  • Precision-Scalable MACs: Unified MAC datapaths support INT8, MXFP8/6/4, and exploit sub-word parallelism, sharing multipliers and exponent adders across formats (Cuyckens et al., 28 May 2025).
  • Custom ISA Extensions: Block dot-products (e.g., MXDOTP) fuse block scaling, elementwise product, accumulation, and final scaling into one instruction (e.g., RISC-V custom opcodes), achieving up to 25×25\times speedup and 12.5×12.5\times energy efficiency gain over software (İslamoğlu et al., 19 May 2025, Zaruba et al., 2020).

Efficiency metrics from prototyped MXFP hardware systems:

Mode Area (MAC, µm²) Energy/Op (pJ) System Efficiency (GOPS/W) Top-1 Acc Δ (ResNet/ViT)
MXFP8 E4M3 1,043 1.11–1.17 1438–1675 <0.1%
MXFP6 E3M2/E2M3 ~1,000 1.05–1.13 1438–1675 <1% (after finetune)
MXFP4 E2M1 1,896 0.39 4065 −2% to −4%

Hardware supports blockwise quantization with full throughput pipeline parallelism, and reduction trees with hybrid floating/integer accumulators for error-bound trade-offs (Cuyckens et al., 9 Nov 2025, Cuyckens et al., 28 May 2025).

5. Empirical Performance and Model Integration

MXFP formats offer competitive empirical accuracy across diverse ML tasks:

  • LLMs: MXFP8 and mixed MXFP6/8 permit sub-8-bit “direct-cast” inference and training of LLMs (GPT, OPT, Llama3, Mistral) with <1% drop vs. FP32, including across generative, discriminative, and multimodal settings (Rouhani et al., 2023, Zhang et al., 14 Jan 2026, Cococcioni et al., 2 Oct 2025).
  • Vision and Speech: ImageNet, Wav2Vec2, and other large-scale benchmarks show MXFP8/MXFP6 matching FP32/FP16 within noise; MXFP4 only after aggressive finetuning or “blockmax” extensions (Rouhani et al., 2023, Lee et al., 16 Oct 2025).
  • Edge and Robotics: In edge training (e.g., robotics continual learning), MXFP enables 4×4\times higher throughput and 25.6%25.6\% area reduction at equivalent energy (Cuyckens et al., 28 May 2025).
  • FFT and HPC: MXFP-based FFTs with per-block and power-of-two prescale achieve near-FP16 (40 dB PSNR) fidelity in MRI, with B=8–32 blocks optimizing performance (Deveshwar et al., 3 Dec 2025).

Key findings from systematic quantization studies include:

6. Algorithmic and System-Level Enhancements

Enhancements and adaptations of MXFP accommodate ultra-low precision and maximize model fidelity:

  • MX+ (Blockmax Exponent Reclaiming): Top-1 (“block-max”) element reuses private exponent as added mantissa, reducing block quant error by up to 4×4\times, with <0.25 average bits/element overhead (Lee et al., 16 Oct 2025).
  • AMXFP4 (Asymmetric MXFP): Dual group-wise shared scales for positive/negative elements, mitigating quantization asymmetry; achieves perplexity/accuracy competitive with calibration-based INT4 quantization but with no calibration (Lee et al., 2024).
  • Nanoscaling (NxFP): Incorporates a “nano-mantissa” into the shared block scale, adaptive microexponent, and code recycling for further quantization error reduction and memory savings over state-of-the-art MXFP (Lo et al., 2024).
  • MR-GPTQ/PTQ Adaptations: Specialized error compensation methods (MR-GPTQ) and affine (FlatQuant) scaling tuned to the unique errors of MXFP4 scale quantization and intra-block blockmax (Egiazarian et al., 27 Sep 2025, Zhang et al., 14 Jan 2026).
  • Square Block Grouping: 8x8 grouping permits efficient weight reuse in both forward and backward passes (training) and minimizes data redundancy (Cuyckens et al., 28 May 2025).
  • Hardware-Algorithm Co-Design: Dynamic channel gating, block rotation (e.g., block Hadamard transforms), and blockmax detection at kernel or hardware MMU level (Cuyckens et al., 9 Nov 2025, Egiazarian et al., 27 Sep 2025, Lee et al., 16 Oct 2025).

7. Limitations, Trade-offs, and Best Practices

MXFP8 and MXFP6 can be used for direct-cast inference, mixed-precision training, and compressed activations with essentially no modification to network architectures, optimization, or training recipes for tasks including LLMs and vision models (Rouhani et al., 2023, Cococcioni et al., 2 Oct 2025, Deveshwar et al., 3 Dec 2025). MXFP4 and lower are viable only with algorithmic enhancements (e.g., blockmax reclamation, asymmetric scaling, PTQ refinement), or as activation-only formats.

The principal trade-offs of MXFP are:

Aspect Low Mantissa (MXFP4) Wide Mantissa (MXFP8)
Dynamic Range High (via shared scale) High
Quantization Error Moderate-high Low
Outlier Handling Strong Strong
Calibration-free Inference Difficult (MXFP4) Easy (MXFP8)
Hardware Efficiency Maximum High

Other practical considerations include non-commutativity of block quantization with transpose (must re-quantize both WW and WW^\top for backprop), locality effects (block size vs. quantization error), and hardware kernel support for non-uniform read/write access (Rouhani et al., 2023, Cuyckens et al., 28 May 2025, Gorodecky et al., 2024).

In sum, Microscaling Floating-Point formats represent a robust, scalable, and hardware-friendly numerics foundation for efficient sub-8-bit deep learning, balancing memory savings, computational efficiency, and model fidelity through the mathematical structure of block-shared scaling and per-element floating-point fields (Rouhani et al., 2023, Cococcioni et al., 2 Oct 2025, Cuyckens et al., 9 Nov 2025, İslamoğlu et al., 19 May 2025, Lee et al., 2024, Lee et al., 16 Oct 2025, Egiazarian et al., 27 Sep 2025, Lo et al., 2024, Zhang et al., 14 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Microscaling Floating-Point (MXFP).