Papers
Topics
Authors
Recent
Search
2000 character limit reached

MXFP8-E4M3 Floating-Point Format

Updated 25 January 2026
  • MXFP8-E4M3 is an 8-bit, non-IEEE floating-point format with 1 sign, 4 exponent, and 3 mantissa bits paired with a block-shared scaling factor to achieve efficient quantization.
  • It delivers near-lossless quantization with less than 0.5% accuracy drop for DNNs and supports rapid dot-product operations on hardware like NVIDIA H100/Blackwell and RISC-V.
  • The format’s blockwise scaling across 32 contiguous values significantly expands its dynamic range, enabling fast, low-width computations in large-scale neural network training and inference.

MXFP8-E4M3 Floating-Point Format

MXFP8-E4M3 is a block-scaled, non-IEEE, 8-bit floating-point format consisting of 1 sign bit, 4 exponent bits, and 3 mantissa (fraction) bits, with an exponent bias of 7. It forms the canonical “microscaling FP8” (“MXFP8-E4M3”) data type, now widely adopted for efficient quantization and training of large-scale neural networks, notably LLMs and vision transformers, particularly on hardware supporting fast, low-width floating-point operations such as NVIDIA H100/Blackwell architectures and RISC-V extensions. The format's design balances dynamic range and precision, provides native hardware support for blockwise dot products, and supports near-lossless quantized inference and training at reduced memory and compute cost.

1. Bit Layout and Numeric Encoding

MXFP8-E4M3 comprises 8 bits partitioned as:

  • 1 sign bit (s): Determines the sign.
  • 4 exponent bits (E): Encoded exponent with bias 7.
  • 3 mantissa bits (m): Encodes the significand fraction.

The encoded value is

  • For 1E141 \leq E \leq 14 (normalized):

x=(1)s2Ebias(1+m23)x = (-1)^s \cdot 2^{E-\text{bias}} \cdot \left(1 + \frac{m}{2^3}\right)

  • For E=0,E=0, m>0m>0 (subnormal):

x=(1)s21biasm23=(1)sm29x = (-1)^s \cdot 2^{1-\text{bias}} \cdot \frac{m}{2^3} = (-1)^s \cdot m \cdot 2^{-9}

  • For E=0,m=0E=0,m=0: Encodes ±0\pm 0.
  • E=15E=15 with m=0m=0: ±\pm\infty (in some implementations, only NaN).
  • E=15E=15, m>0m>0: NaN.

The representable range for normalized numbers is 26x<2402^{-6} \leq |x| < 240 with a machine epsilon ε=0.125\varepsilon=0.125 at x=1|x|=1; subnormals extend the minimum to 292^{-9} (Wu et al., 2023, Huang et al., 2021, Zhang et al., 14 Jan 2026).

Field Bit-width Interpretation Range/Values
Sign (s)(s) 1 $0$ for ++, $1$ for - {0,1}\{0,1\}
Exponent (E)(E) 4 Encoded, bias =7=7 [0,15][0,15]
Mantissa (m)(m) 3 $0$ for integer, >0>0 fraction [0,7][0,7]

2. Blockwise Scaling (Microscaling) and Conversion

MXFP8-E4M3 is always paired with a block-shared scale (typically K=32K=32 contiguous values share one scale), stored as an 8-bit power-of-two unsigned exponent (e.g., UE8M0 as per Blackwell GPUs). The encoding-decoding proceeds as follows (Mishra et al., 30 May 2025, Gorodecky et al., 2024, İslamoğlu et al., 19 May 2025):

  • For input tensor VV partitioned into blocks bb of length KK:
    1. Compute amax=maxiVb[i]a_{\max} = \max_i |V_b[i]|.
    2. Compute scale exponent expX=log2(amax/destmax){\sf expX} = \lceil \log_2(a_{\max}/\text{destmax}) \rceil, clamped to [127,127][-127,127].
    3. Write scale as sb=expX+127s_b = {\sf expX} + 127 (in UE8M0).
    4. Per element: Qi=RN(Vb[i]/X)Q_i = \operatorname{RN}(V_b[i]/X); clamp to E4M3 limits.
    5. Store quantized block (Q0,,QK1,sb)(Q_0,\ldots, Q_{K-1}, s_b).

On dequantization or hardware conversion, QiQ_i is expanded to real value using the per-block scale:

Vb[i]=QiX=(1)s2Ebias(1+m8)2expXV_b[i] = Q_i \cdot X = (-1)^{s} 2^{E-\text{bias}} \left(1 + \frac{m}{8}\right) \cdot 2^{\text{expX}}

Pseudo-code and specific conversion details in (Mishra et al., 30 May 2025, Gorodecky et al., 2024, Zhang et al., 14 Jan 2026).

3. Dynamic Range, Precision, and Comparison

The normalized representable range is [26,240][2^{-6}, 240]; with blockwise scaling, aggregate dynamic range is greatly expanded (292^{-9} to 2402127240 \cdot 2^{127} with power-of-two block scale). The mantissa provides 4 bits of precision (unit in the last place, ULP, 232^{-3}). Subnormal values fill the underflow gap, enabling contiguous quantization even for small values.

Compared to E5M2 (5 exponent, 2 mantissa) or INT8:

  • E4M3 maintains higher mantissa precision but a smaller exponent range.
  • For distributions dominated by outliers, E4M3 (thanks to the exponent bits) outperforms integer and E5M2 variants in quantizing LLM activations and weights with heavy-tailed or spiky statistics (Wu et al., 2023, Kuzmin et al., 2022, Zhang et al., 14 Jan 2026).
  • Empirical results on DNNs (ResNet, VGG) show that E4M3 achieves <0.5%<0.5\% accuracy drop vs. FP32 with proper bias selection and block scaling (Huang et al., 2021).
  • On LLMs, MXFP8-E4M3 quantization yields 0.3%\leq 0.3\% average accuracy drop and 5%\leq 5\% relative perplexity increase in the worst case under post-training quantization (W8A8) (Zhang et al., 14 Jan 2026).

4. Deployment in Neural Network Quantization

For quantized LLMs and vision models, MXFP8-E4M3 is applied as follows:

  • Activations: Per-token or per-tensor affine quantization using the E4M3 grid, with scales calibrated to fit maximum magnitude into [240,240][-240,240] (Wu et al., 2023).
  • Weights: Fine-grained group quantization (FGQ), typically with block sizes of 32 and dyadic (power-of-two) scales to facilitate efficient hardware mapping. Converted to E4M3 at runtime for hardware compatibility.
  • Post-Training Quantization (PTQ): Blockwise (size 32) scaling, round-to-nearest-even, various algorithms (GPTQ, MR-GPTQ, FlatQuant, SmoothQuant) adapt readily. Rotational transform methods (QuaRot, SpinQuant) do not improve, and sometimes harm, performance in low-bit MXFP (Zhang et al., 14 Jan 2026).
  • Scaling Heuristics: Ceil(log2)-based rounding of the scale ensures all values are in-range, minimizes overflows, and stabilizes training. In INT4/FP4+FP8 hybrid settings, scale alignment (nearest power-of-two, group-wise dyadic) enables fast reinterpret casts and high TensorCore throughput (Mishra et al., 30 May 2025, Wu et al., 2023).

Empirically, E4M3 quantization enables quantized pre-training and inference for LLMs at double the speed of BF16, while maintaining matching perplexity and accuracy with models up to 8B or even 16B parameters (Mishra et al., 30 May 2025, Zhang et al., 14 Jan 2026).

5. Hardware Implementations and Architectures

MXFP8-E4M3 enjoys broad hardware support, both in commercial GPUs and open-source cores:

  • NVIDIA H100/Blackwell TensorCores: Native E4M3 arithmetic kernels perform MXFP8 matmuls and fused-multiply operations at full speed; block-wise scaling is managed by software with per-block scales, and dyadic-intensity optimizations facilitate ultra-low-latency FP4→FP8 conversion (Wu et al., 2023, Mishra et al., 30 May 2025).
  • RISC-V MXDOTP Extension: Dedicated instructions implement fast blockwise dot-products with packed 8-bit E4M3 operands and power-of-two block scales, achieving >350>350 GFLOP/W at 1 GHz, 25× higher energy efficiency than software-emulated FP8 (İslamoğlu et al., 19 May 2025).
  • FPGA Block Converters: Conversion units for FP32→MXFP8-E4M3 efficiently process batches of 32 values, hierarchically extracting the scale and mapping to E4M3 (Gorodecky et al., 2024). The design partitions LUTs between max-exponent trees and quantization logic, supporting high-throughput, low-area implementations.
  • FPnew Transprecision FPU: Parameterized multiprecision units, supporting E4M3 as a native format, achieve up to 2.95 TFLOP/s/W in 8×SIMD mode and 14.8 GFLOPs single-core at 923 MHz (Mach et al., 2020). All IEEE-754 rounding and exception handling is supported.
Hardware Block Size Throughput (GFLOPs) Energy Efficiency (TFLOPs/W)
NVIDIA H100/Blackwell 32 2× BF16 GEMM Not specified
RISC-V MXDOTP (8-core) 8 102 0.356
FPnew (Ariane core) 8 14.8 1.25 (up to 2.95)
FPGA (xcvu440) 32 N/A N/A

Throughput measured at baseline voltage/frequency; see respective papers for details (İslamoğlu et al., 19 May 2025, Mach et al., 2020, Gorodecky et al., 2024, Mishra et al., 30 May 2025, Wu et al., 2023)

6. Algorithmic and Practical Considerations

  • Rounding: “Round-to-nearest-even” is standard; saturating to the E4M3 representable minimum or maximum as necessary (Zhang et al., 14 Jan 2026, Mishra et al., 30 May 2025).
  • Best Practices:
    • Block size K=32K=32 for scaling.
    • Use “ceil(log2)” in scale computation to bound all values in range.
    • Calibration set size: $512$–$1024$ samples per layer suffices.
    • Prefer error-compensation or affine PTQ algorithms; avoid rotational transforms for E4M3 (Zhang et al., 14 Jan 2026).
  • Low-Rank Compensation (LoRC): Adding a blockwise low-rank correction to quantized weights effectively restores subnormal precision; a single r=4r=4–$8$ correction can close most of the quality gap for small models (Wu et al., 2023).
  • Format Selection: For models/activations with Gaussian distributions and negligible outliers, E5M2 may give marginally higher SQNR, but E4M3 is superior when LLMs or ViTs present heavy tails, outliers, or large activation spikes (Kuzmin et al., 2022).
  • Per-Layer or Per-Group Flexibility: Allowing bias or field widths to be tuned per-layer can recover several tenths of a percent in accuracy without retraining (Huang et al., 2021).

7. Empirical Results and Application Scope

Benchmarking across LLM, vision, and mixture-of-expert models consistently finds:

  • W8A8 MXFP8-E4M3 delivers near-lossless quantization in LLMs, with <1%<1\% degradation across post-training quantization methods. W4A8 is feasible with refined PTQ but exhibits 1–3% additional loss; W4A4 is not recommended (Zhang et al., 14 Jan 2026).
  • On DNN classifiers (ResNet-50, VGG-16), E4M3 layer-wise flexible quantization reduces top-1 error to within 0.3%0.3\% of FP32 (Huang et al., 2021).
  • On H100/Blackwell hardware, MXFP8-E4M3 enables all-core 8-bit inference and training with no significant accuracy loss (PPL curve overlap with BF16), even for multi-billion-parameter LLMs over extended training horizons (Mishra et al., 30 May 2025).
  • Hardware-efficient deployment: Zero-overhead in dot product and matrix-multiplication kernels (fast as INT8 or faster, but more accurate in the presence of outliers).

MXFP8-E4M3, by balancing exponent and mantissa bits and leveraging blockwise scaling, provides a broadly applicable, hardware-native, and quantization-efficient format for modern neural network inference and training pipelines (Mishra et al., 30 May 2025, İslamoğlu et al., 19 May 2025, Wu et al., 2023, Zhang et al., 14 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MXFP8-E4M3 Floating-Point Format.