Microscaling Floating-Point (MXFP) Overview
- Microscaling Floating-Point (MXFP) is a low-precision representation using block-wise mini-floats with a shared scale to optimize dynamic range and quantization accuracy.
- It reduces memory usage and enhances compute efficiency, making it ideal for AI accelerators in large language models and edge devices.
- MXFP offers versatile format variants that balance quantization error and hardware performance, enabling drop-in integration with advanced quantization techniques.
Microscaling Floating-Point (MXFP) is a family of block-wise low-precision floating-point representations that combine compact per-element storage with a shared scale across small groups, enabling high dynamic range, fine-grained quantization, and efficient hardware implementation. MXFP formats are central to modern AI accelerators, especially in contexts demanding aggressive quantization (4–8 bits) for LLMs and edge devices, where both memory and compute resources are at a premium. The MXFP principle generalizes block floating-point by integrating low-bitwidth per-element “mini-floats” and a shared, typically 8-bit, exponent per block, thus achieving better accuracy-per-bit than uniform per-tensor scaling schemes and standard low-precision floating point.
1. Formal Definition, Bit Structure, and Encoding
A typical MXFP block contains elements (standard: ), each represented as a low-bit IEEE-style mini-float with $1$ sign, exponent, and mantissa bits. All elements share a single -bit block exponent or scale, often realized as an 8-bit E8M0 power-of-two factor. The canonical element formats are:
Individual element in block with shared scale is decoded as: where is the sign, the local exponent, the mantissa of element , and “bias” is determined by the element exponent width.
Encoding involves:
- For block of input values , compute .
- Set , where is the maximum normal exponent for the element format.
- Quantize each element as using round-to-nearest or other policies, clamping overflows.
This process selects the block scale to maximize range utilization and minimizes quantization error within each block (Rouhani et al., 2023, Lee et al., 16 Oct 2025, Cuyckens et al., 9 Nov 2025).
2. Rationale, Numerical Properties, and Format Variants
MXFP aims to strike a balance among dynamic range, quantization precision, and hardware efficiency. The key advantages are:
- Dynamic Range: The shared 8-bit block scale provides an effective dynamic range matching or exceeding FP32, mitigating underflow/overflow even at 4–8 bits per element.
- Precision Tuning: Per-element mini-floats admit fine-grained trade-off between exponent (range) and mantissa (local precision), with the OCP standard supporting E5M2, E4M3, E3M2, E2M3, and E2M1 layouts.
- Drop-in Integration: The shared scale and per-block quantization can be implemented as a drop-in replacement for FP32/FP16 tensors and dot-product kernels, requiring minimal code changes (Rouhani et al., 2023, Cuyckens et al., 9 Nov 2025).
A summary of typical MXFP format characteristics:
| Format | Bits/Elem | Exponent | Mantissa | Block Size | Shared Scale |
|---|---|---|---|---|---|
| MXFP8-E5M2 | 8 | 5 | 2 | 32 | E8M0 (PowerOf2) |
| MXFP8-E4M3 | 8 | 4 | 3 | 32 | E8M0 (PowerOf2) |
| MXFP6-E3M2 | 6 | 3 | 2 | 32 | E8M0 (PowerOf2) |
| MXFP6-E2M3 | 6 | 2 | 3 | 32 | E8M0 (PowerOf2) |
| MXFP4-E2M1 | 4 | 2 | 1 | 32 | E8M0 (PowerOf2) |
Each block’s effective bits per element is .
3. Quantization, Outlier Handling, and Format Extensions
MXFP uses a block-wise absolute-max quantization strategy; the block scale is chosen so the largest element fits the mini-float range, and all other elements are normalized accordingly. This provides robust outlier suppression, essential for uncalibrated post-training quantization (PTQ) and aggressive quantization in LLMs and vision transformers (Lee et al., 2024, Egiazarian et al., 27 Sep 2025, Lee et al., 16 Oct 2025).
However, small mantissa widths in very low-bit MXFP (e.g., MXFP4) induce large block-wise quantization error, particularly when blocks contain outliers or exhibit strong within-block asymmetry. Recent extensions, such as MX+ (Lee et al., 16 Oct 2025), address this by granting the block-max element increased precision (recycling its unused exponent bits as additional mantissa) and storing its index in block metadata, reducing block-max quantization error by up to .
The AMXFP4 (asymmetric MXFP4) format further introduces per-block asymmetric scaling—independently scaling positive and negative elements—addressing group-wise asymmetry created by “micro-grouping” in LLM activations. This yields accuracy gains over both symmetric MXFP4 and rotation-based INT4 methods (Lee et al., 2024).
Other format variants include adaptive hybrid block BFP/MXFP (Nanoscaling/NxFP with NanoMantissa, Adaptive Microexponent, and Code Recycling) enabling further memory savings and quantization-error reduction (Lo et al., 2024).
4. Hardware Implementation and Efficiency
MXFP formats are specifically tailored for efficient hardware realization on ASICs and FPGAs (Cuyckens et al., 9 Nov 2025, Gorodecky et al., 2024, İslamoğlu et al., 19 May 2025). Key microarchitectural features include:
- Aligned Block Access: Simultaneous processing of elements with a single shared scale register per block maximizes data-path utilization.
- Simple Scaling: E8M0 scaling reduces to shift operations, avoiding general multipliers for block rescaling and dequantization.
- Precision-Scalable MACs: Unified MAC datapaths support INT8, MXFP8/6/4, and exploit sub-word parallelism, sharing multipliers and exponent adders across formats (Cuyckens et al., 28 May 2025).
- Custom ISA Extensions: Block dot-products (e.g., MXDOTP) fuse block scaling, elementwise product, accumulation, and final scaling into one instruction (e.g., RISC-V custom opcodes), achieving up to speedup and energy efficiency gain over software (İslamoğlu et al., 19 May 2025, Zaruba et al., 2020).
Efficiency metrics from prototyped MXFP hardware systems:
| Mode | Area (MAC, µm²) | Energy/Op (pJ) | System Efficiency (GOPS/W) | Top-1 Acc Δ (ResNet/ViT) |
|---|---|---|---|---|
| MXFP8 E4M3 | 1,043 | 1.11–1.17 | 1438–1675 | <0.1% |
| MXFP6 E3M2/E2M3 | ~1,000 | 1.05–1.13 | 1438–1675 | <1% (after finetune) |
| MXFP4 E2M1 | 1,896 | 0.39 | 4065 | −2% to −4% |
Hardware supports blockwise quantization with full throughput pipeline parallelism, and reduction trees with hybrid floating/integer accumulators for error-bound trade-offs (Cuyckens et al., 9 Nov 2025, Cuyckens et al., 28 May 2025).
5. Empirical Performance and Model Integration
MXFP formats offer competitive empirical accuracy across diverse ML tasks:
- LLMs: MXFP8 and mixed MXFP6/8 permit sub-8-bit “direct-cast” inference and training of LLMs (GPT, OPT, Llama3, Mistral) with <1% drop vs. FP32, including across generative, discriminative, and multimodal settings (Rouhani et al., 2023, Zhang et al., 14 Jan 2026, Cococcioni et al., 2 Oct 2025).
- Vision and Speech: ImageNet, Wav2Vec2, and other large-scale benchmarks show MXFP8/MXFP6 matching FP32/FP16 within noise; MXFP4 only after aggressive finetuning or “blockmax” extensions (Rouhani et al., 2023, Lee et al., 16 Oct 2025).
- Edge and Robotics: In edge training (e.g., robotics continual learning), MXFP enables higher throughput and area reduction at equivalent energy (Cuyckens et al., 28 May 2025).
- FFT and HPC: MXFP-based FFTs with per-block and power-of-two prescale achieve near-FP16 (40 dB PSNR) fidelity in MRI, with B=8–32 blocks optimizing performance (Deveshwar et al., 3 Dec 2025).
Key findings from systematic quantization studies include:
- MXFP8 (W8A8) is inherently near-lossless for LLM inference and training; round-to-nearest suffices (Zhang et al., 14 Jan 2026).
- MXFP4 (W4A4) introduces significant loss; error compensation (MR-GPTQ), affine transforms (FlatQuant), and pre-scaling are critical (Lee et al., 16 Oct 2025, Egiazarian et al., 27 Sep 2025, Zhang et al., 14 Jan 2026).
- Outlier-suppression via blockwise scaling is robust, but precision loss in block-max and group-wise asymmetry are recurring challenges, especially below 6 bits (Lee et al., 16 Oct 2025, Lo et al., 2024).
- Empirically, block size represents a practical compromise; larger blocks increase data compression but can accumulate more quantization error in outlier-dominated blocks (Deveshwar et al., 3 Dec 2025).
6. Algorithmic and System-Level Enhancements
Enhancements and adaptations of MXFP accommodate ultra-low precision and maximize model fidelity:
- MX+ (Blockmax Exponent Reclaiming): Top-1 (“block-max”) element reuses private exponent as added mantissa, reducing block quant error by up to , with <0.25 average bits/element overhead (Lee et al., 16 Oct 2025).
- AMXFP4 (Asymmetric MXFP): Dual group-wise shared scales for positive/negative elements, mitigating quantization asymmetry; achieves perplexity/accuracy competitive with calibration-based INT4 quantization but with no calibration (Lee et al., 2024).
- Nanoscaling (NxFP): Incorporates a “nano-mantissa” into the shared block scale, adaptive microexponent, and code recycling for further quantization error reduction and memory savings over state-of-the-art MXFP (Lo et al., 2024).
- MR-GPTQ/PTQ Adaptations: Specialized error compensation methods (MR-GPTQ) and affine (FlatQuant) scaling tuned to the unique errors of MXFP4 scale quantization and intra-block blockmax (Egiazarian et al., 27 Sep 2025, Zhang et al., 14 Jan 2026).
- Square Block Grouping: 8x8 grouping permits efficient weight reuse in both forward and backward passes (training) and minimizes data redundancy (Cuyckens et al., 28 May 2025).
- Hardware-Algorithm Co-Design: Dynamic channel gating, block rotation (e.g., block Hadamard transforms), and blockmax detection at kernel or hardware MMU level (Cuyckens et al., 9 Nov 2025, Egiazarian et al., 27 Sep 2025, Lee et al., 16 Oct 2025).
7. Limitations, Trade-offs, and Best Practices
MXFP8 and MXFP6 can be used for direct-cast inference, mixed-precision training, and compressed activations with essentially no modification to network architectures, optimization, or training recipes for tasks including LLMs and vision models (Rouhani et al., 2023, Cococcioni et al., 2 Oct 2025, Deveshwar et al., 3 Dec 2025). MXFP4 and lower are viable only with algorithmic enhancements (e.g., blockmax reclamation, asymmetric scaling, PTQ refinement), or as activation-only formats.
The principal trade-offs of MXFP are:
| Aspect | Low Mantissa (MXFP4) | Wide Mantissa (MXFP8) |
|---|---|---|
| Dynamic Range | High (via shared scale) | High |
| Quantization Error | Moderate-high | Low |
| Outlier Handling | Strong | Strong |
| Calibration-free Inference | Difficult (MXFP4) | Easy (MXFP8) |
| Hardware Efficiency | Maximum | High |
Other practical considerations include non-commutativity of block quantization with transpose (must re-quantize both and for backprop), locality effects (block size vs. quantization error), and hardware kernel support for non-uniform read/write access (Rouhani et al., 2023, Cuyckens et al., 28 May 2025, Gorodecky et al., 2024).
In sum, Microscaling Floating-Point formats represent a robust, scalable, and hardware-friendly numerics foundation for efficient sub-8-bit deep learning, balancing memory savings, computational efficiency, and model fidelity through the mathematical structure of block-shared scaling and per-element floating-point fields (Rouhani et al., 2023, Cococcioni et al., 2 Oct 2025, Cuyckens et al., 9 Nov 2025, İslamoğlu et al., 19 May 2025, Lee et al., 2024, Lee et al., 16 Oct 2025, Egiazarian et al., 27 Sep 2025, Lo et al., 2024, Zhang et al., 14 Jan 2026).