Microscaling (MXINT) Formats Overview
- Microscaling (MXINT) formats are block-scaled quantization techniques that apply a per-block power-of-two scaling factor with narrow-width representations like INT8 and FP8.
- They partition tensors into blocks (typically of size 32) and use quantization and dequantization algorithms to maintain accuracy while reducing compute and memory demands.
- Standardized by the OCP MX working group, these formats are natively supported in modern GPUs, FPGAs, and NPUs, significantly enhancing throughput and energy efficiency.
Microscaling (MXINT) Formats
Microscaling (MXINT) formats are a family of block-scaled, low-bitwidth quantization schemes for model weights, activations, and gradients, designed to enable highly efficient large-scale deep learning on modern hardware accelerators. They combine a per-block, power-of-two scaling factor ("microscale") with narrow-width element representations (e.g., INT8, various FP8/FP6 variants) to simultaneously achieve wide dynamic range, high compute/memory density, and minimal accuracy loss during both inference and training. These formats are standardized by the OCP MX working group and are natively supported in recent GPU (NVIDIA Blackwell), FPGA, and NPU architectures (Mishra et al., 30 May 2025).
1. Core Structure and Definitions
In MXINT, tensors are partitioned into blocks (typically size ). Each block is represented by a single shared scale factor —always a power of two, encoded via the 8-bit E8M0 format—and elements each in a narrow integer or low-bit float format (e.g., INT8, E4M3, E5M2, etc). The canonical INT version, MXINT8, stores each element as an 8-bit signed integer:
For a block :
- ,
- ,
- ,
- ,
- .
The block representation is
- integer elements (): usually INT8, but also explored for INT4, INT6, E2M3, etc,
- 8-bit exponent (per block).
The same shared scaling pattern can be applied to low-bit floating-point element layouts, most commonly E4M3 (sign:1, exponent:4, mantissa:3), E5M2, and down to E2M1 (4 bits/element).
2. Quantization and Dequantization Algorithms
Microscaling quantization consists of three key steps applied per block:
- Block scaling: Find , set for floats, or (power-of-two rounded up) for integers, with for INT8.
- Element-wise quantization: .
- Dequantization: .
Crucially, the scale is always a power-of-two, which enables efficient hardware implementation via shift/scaling rather than general multiplication (Rouhani et al., 2023, Mishra et al., 30 May 2025).
In floating-point MX formats (e.g., E4M3 or E5M2), quantization proceeds as
- where elementwise quantization follows the IEEE-754 layout with reduced exponent/mantissa.
Block scales are recomputed for every block access (no "scale update frequency" hyperparameter). Rounding up of exponent (ceil) is critical to avoid out-of-range errors and guarantee no overflow during quantization (Mishra et al., 30 May 2025).
For training, various enhancements are used: symmetric clipping for INT8 (mapping to eliminate negative bias in gradient flows), optional Hadamard rotation for INT4 robustness, and first- and second-order quantization-aware training (QAT) methods (Chen et al., 29 Oct 2025, Sharify et al., 2024).
3. Precision Selection, Hybrid and Metadata-Enhanced Schemes
The choice of element type (INT8, E4M3, E5M2, INT6, INT4) and block size is central to the performance-vs-accuracy tradeoffs:
- MXFP8 (E4M3): matches BF16 in LLM pre-training up to 8B parameters with loss, while doubling matmul throughput (Mishra et al., 30 May 2025).
- MXINT8: typically outperforms MXFP8 in both accuracy and hardware efficiency at block size 32; enables stable end-to-end training and inference in LLMs (Chen et al., 29 Oct 2025).
- Lower bitwidths (MXFP4, MXINT4): 4-bit integer or float quantization incurs significant accuracy degradation unless combined with error mitigation, e.g., Hadamard rotation for INT4, or extra metadata/mantissa bits for MXFP4 (Hu et al., 27 Jan 2026, Lee et al., 16 Oct 2025).
Mixed-precision and block-level allocation: Leading software/hardware co-design systems, such as MASE, MicroMix, MixDiT, allocate higher precision (e.g., MXFP8/MX9) to outlier-rich channels/heads and use MXFP4/6/INT4 for the majority ("energy-based" channel selection, thresholding via quantization error bounds). These approaches spatially mix precisions within layers, achieving accuracy loss at mean bitwidths (Liu et al., 4 Aug 2025, Kim et al., 11 Apr 2025, Cheng et al., 2023).
Metadata-enhanced MX formats: MXFP (Hu et al., 27 Jan 2026) and MX (Lee et al., 16 Oct 2025) augment standard MX blocks with sub-block or per-block mini-mantissa fields, block max outlier handling, and/or multi-scale metadata. This closes 70–95% of the 4-bit to 8-bit MXFP accuracy gap at only 0.25–0.5 bits/element storage overhead.
4. Hardware Implementations
MXINT formats are extensively supported in co-designed hardware stacks:
- GPU tensor-cores (NVIDIA Blackwell): Native MXFP8 hardware-accelerated quantize/dequantize at block boundaries, with 2× the throughput of BF16. Full GEMM workflows quantize both row and column-oriented blocks, accumulating results in FP32 and re-quantizing if downstream ops require MX formats (Mishra et al., 30 May 2025, Liu et al., 4 Aug 2025).
- Custom MAC units (Jack Unit): Precision-scalable carry-save multipliers, per-block exponent alignment, and sub-word parallelism enable direct execution of integer and FP MX ops on a single datapath, reducing area and power by 1.2–2.0× versus baseline MACs (Noh et al., 7 Jul 2025).
- RISC-V and NPU integration (MXDOTP, SNAX): MXFP8 and MXINT8 dot products fused into single-cycle instructions, three-stage pipelines, and precision-configurable arrays. Hybrid precision-reduction trees for accumulation allow optimal trade-off between FP32/INT accumulation cost and normalization (İslamoğlu et al., 19 May 2025, Cuyckens et al., 9 Nov 2025).
- FPGA implementations: Memoryless conversion engines for all OCP-standard MXINT/MXFP variants (E5M2, E4M3, E3M2, E2M3, E2M1, INT8). Full pipeline support for block quantization, accumulation, and re-scaling in custom IP. Open-source hardware and PyTorch libraries (e.g., Brevitas) provide design flexibility for non-standard formats such as INT5, FP6 (Gorodecky et al., 2024, Samson et al., 2024).
5. Empirical Results in Large-Scale Training and Inference
LLMs:
- MXFP8-E4M3 achieves BF16-equivalent perplexity and zero/few-shot accuracy on 8B models trained on 15T tokens; throughput is doubled and memory use is halved (per parameter) compared to BF16 (Mishra et al., 30 May 2025).
- MXINT8 matches or exceeds MXFP8 in both inference (KL divergence, QSNR) and training settings, with hardware savings of 20–40% (area, energy). Loss curves and zero/few-shot accuracy for Llama scale models are effectively indistinguishable from BF16 (Chen et al., 29 Oct 2025, Rouhani et al., 2023).
- MXINT4/NVINT4 with Hadamard rotation outperform corresponding FP formats for certain tasks; without error mitigation, INT4 and FP4 blocks suffer large degradation unless paired with GPTQ or metadata enhancements (Chen et al., 29 Oct 2025, Sharify et al., 2024, Hu et al., 27 Jan 2026).
- Hybrid and metadata-augmented formats: Mixed-precision allocation (e.g., MicroMix, MixDiT) allows the use of 50% MXFP4/MX6/INT4 channels, with selectively higher-precision outlier blocks/channels. MXFP and MX yield sub-1% accuracy drops for average bitwidths near 4.5. MX recoups more than 20–40% of the accuracy lost in "bare" 4-bit MX (Lee et al., 16 Oct 2025, Hu et al., 27 Jan 2026).
Other modalities:
- FFT in MRI: MXFP8-E4M3 with block size 32, 3-bit mantissa, achieves near-FP16 image quality (, ) at compression (Deveshwar et al., 3 Dec 2025).
- Vision Transformers (ViT): MXINT6/8 yields top-1 loss, memory savings, and – speedup relative to Float16 on FPGAs. All operators (incl. Softmax, LayerNorm, GELU) can be mapped to MXINT+LUT structures (Xiao et al., 28 May 2025).
- Robotics, edge learning: MXINT8 (and other MX types) enable 51% lower memory use and higher throughput at iso-area/iso-energy compared to prior continuous learning accelerators (Cuyckens et al., 28 May 2025).
6. Limitations, Instabilities, and Mitigations
Though MXINT formats are highly hardware- and accuracy-efficient, there are specific limitations:
- Training instabilities: Direct end-to-end LLM training with block quantization of all tensors sometimes exhibits sharp, irrecoverable loss spikes, traced to quantization-induced multiplicative gradient bias—especially in LayerNorm affine parameter and activation blocks where value distributions are tightly clustered (Su et al., 25 Jun 2025).
- Mitigation: Retaining higher precision for activations, LayerNorm weights, or backward passes (e.g., quantizing only weights, or forward-only quantization) eliminates instabilities and achieves full convergence (Su et al., 25 Jun 2025).
- Block size effects: Larger blocks amortize scale overhead, but yield coarser dynamic-range adaptation; block sizes 32–64 are widely used as a practical compromise (Samson et al., 2024).
- Transpose and channel reordering: Quantization and transpose are non-commutative; two orientations (row, column) must be stored per weight tensor for GEMM-dominated models (Mishra et al., 30 May 2025).
For ultra-low bitwidth MX (MXFP4, MXINT4), accuracy is only retained if augmented with metadata (MXFP, MX), judicious channel/block reordering, outlier-aware allocation, or QAT/PTQ techniques such as GPTQ for weights and SmoothQuant for activations (Hu et al., 27 Jan 2026, Liu et al., 4 Aug 2025, Sharify et al., 2024).
7. Integration and Best Practices
- Software support: Key stacks include NVIDIA Transformer Engine, Megatron-LM, Brevitas (PyTorch-based), and custom pipelined dataflow compilers (MASE). These provide block quantization routines, support arbitrary MXINT/MXFP types, and search for optimal per-tensor precision (Mishra et al., 30 May 2025, Cheng et al., 2023, Samson et al., 2024).
- Model graph transformations: Replace GEMM/Conv calls with MXINT/MXFP-aware kernels, keeping master weights in higher precision for training, and quantizing on-the-fly during forward and backward passes (Rouhani et al., 2023, Mishra et al., 30 May 2025).
- Operational guidelines: Use E4M3 for all weight/activation/activation-gradient tensors in high-accuracy LLMs; keep embeddings and final projections in BF16 as standard practice. Employ round-to-nearest-even for quantization. For mixed and ultra-low precision, leverage metadata-enhanced formats, outlier-aware selection, and channel reordering (Mishra et al., 30 May 2025, Hu et al., 27 Jan 2026).
In sum, Microscaling (MXINT) formats, by combining per-block scaling with flexible low-bit element layouts and supporting rich hardware-software co-design, provide the foundation for aggressive quantization of large models with minimal loss of fidelity, and have rapidly become the industry standard for inference and increasingly for training in AI accelerators (Mishra et al., 30 May 2025, Chen et al., 29 Oct 2025, Rouhani et al., 2023, Hu et al., 27 Jan 2026, Lee et al., 16 Oct 2025).