FP8 Low-Precision Computation
- FP8 low-precision computation is a method of using 8-bit floating-point formats like E4M3 and E5M2 to optimize deep learning performance by reducing memory usage and increasing computational throughput.
- It employs advanced quantization techniques and dynamic scaling strategies to maintain training stability and accuracy within 0.1–1% of higher-precision baselines.
- Modern hardware accelerators integrate specialized FP8 pipelines, including fused kernels and precision-tunable accumulators, achieving significant speedups and memory savings in large-scale models.
FP8 low-precision computation refers to the use of 8-bit floating-point number formats, typically E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits), in neural network training and inference—supplanting traditional 16- or 32-bit floating-point representations to achieve substantial improvements in memory utilization, bandwidth, and computational throughput. Modern accelerators (e.g., NVIDIA Hopper/Ada, Intel Gaudi2/3, recent RISC-V cores) implement optimized FP8 pipelines, enabling full-stack precision reduction and new algorithmic interventions that maintain training stability and accuracy at scale. This article surveys the formats, hardware architectures, quantization algorithms, architectural modifications, stability considerations, and empirical outcomes underpinning contemporary FP8 research and deployment.
1. FP8 Number Formats and Representational Properties
FP8 arithmetic centers on two IEEE-inspired formats standardized for deep learning, E4M3 and E5M2 (Micikevicius et al., 2022, Fishman et al., 2024):
- E4M3: 1 sign bit, 4 exponent bits (bias=7), 3 mantissa bits; dynamic range for normals: to $448$; unit roundoff (½ ulp) for normals ≤ 6.25%.
- E5M2: 1 sign, 5 exponent bits (bias=15), 2 mantissa bits; range: to $57344$; unit roundoff ≤ 12.5%.
E4M3 omits infinities in the normal range, reserving one NaN codepoint, and maximizes precision near zero, making it suitable for forward activations and weights. E5M2 retains full IEEE-754 coverage (NaN/Inf/subnormals), offering extended dynamic range for backward pass and gradient storage (Micikevicius et al., 2022).
Arithmetic units (e.g., Tensor Cores) typically upcast inputs to higher precision (FP16/BF16) for accumulation, with inputs/outputs quantized to FP8, and scale factors (per-tensor or per-block) encode the dynamic range mapping (Shah et al., 2024, Lee et al., 13 Mar 2025).
2. Quantization, Scaling Strategies, and Algorithmic Pipelines
Quantization/Dequantization
For a real-valued tensor , quantization uses a per-tensor or per-block scale , mapping as follows:
- Quantization: , clamped to the FP8 representable range.
- Dequantization: .
Selection of (scale) is vital: common policies include per-tensor (max-abs), per-channel (row or column max), and hybrid per-group (Gangwal et al., 2022, Fishman et al., 2024). Block-quantization (128-element tiles) is widely adopted for compute alignment and hardware efficiency (Shah et al., 2024, Wang et al., 4 Nov 2025).
Dynamic Scaling and Stability
Delayed scaling schemes preserve stability by maintaining a history of absolute maxima and adaptively tuning to avoid late-stage underflow/overflow. For FP8 GEMM in LLM training, this yields a minimal-overhead, tensor-wise pipeline, where all quantization and dequantization is handled in the forward and backward passes with a single scale tracked per tensor (Hernández-Cano et al., 26 May 2025).
Smooth quantization, dynamic range expansion (via nonlinear mappings, e.g., ), and "unit scaling" (initializing all tensors to unit variance) further mitigate stability risks from limited dynamic range, especially for optimizer state or in non-linear contexts (Xi et al., 2024, Blake et al., 2023).
Numeric Error and Consistency
FP8 inherently increases quantization and rounding error, but judicious accumulation in FP16 or FP32, careful scale selection, and distribution-aware designs keep final training loss and accuracy within 0.1–1% of BF16/FP16 baselines across major models (Micikevicius et al., 2022, Xi et al., 2024, Peng et al., 2023).
3. Architectural and Dataflow Innovations for FP8 Training
GEMM and Grouped GEMM
State-of-the-art approaches achieve end-to-end FP8 computation for all GEMMs in transformer blocks (including attention), using architectural modifications to suppress outliers and "post-norm" residuals with low initial gain (Hernández-Cano et al., 26 May 2025). This circumvents prior reliance on fallbacks to higher precision in attention or activation bottlenecks.
Motivated by memory/computation constraints in MoE models or variable input sizes, TMA-Adaptive FP8 Grouped GEMM eliminates inefficient padding by dynamically selecting pre-defined TMA descriptors and employing dual-phase load/store strategies to guarantee memory alignment without wasting bandwidth (Su et al., 7 Aug 2025, Wang et al., 4 Nov 2025).
Fused and Blockwise Kernels
FlashAttention-3 and similar kernels maximize Hopper/Gaudi utilization by interleaving asynchronous data movement (TMA) and GEMM, block quantization, block softmax, and hardware-fused per-block rescale. This enables FP8 attention to approach 1.2 PFLOPs/s, hardware utilization, and a 1.6–2.0 speedup over FP16 (Shah et al., 2024).
Fused operator suites (e.g., fused SwiGLU+Quantize, fused permute+pad) further reduce kernel launches and HBM traffic, essential in massive MoE training (Wang et al., 4 Nov 2025).
Specialized Hardware
Dedicated FP8 MAC units with stochastic rounding and precision-tunable accumulators (e.g., FP8→FP12 MAC with eager SR and no subnormals) are prototyped to reduce hardware delay, area, and energy—up to 50% compared with single-precision—while maintaining accuracy (Ali et al., 2024). Open RISC-V extensions (MiniFloat-NN, ExSdotp) implement SIMD ISA instructions, fused expanding sum-of-dot-products, and support both 4e3m and 5e2m FP8 at throughput up to 575 GFLOPS/W (Bertaccini et al., 2022).
4. Stability, Outliers, and Numeric Pathologies
Prolonged FP8 training uncovers unique instabilities, most notably "outlier amplification" in SwiGLU activations during trillion-token scale training. Early FOG (Fast Outlier-Guarded) architectures employ post-norm residuals, frozen QK normalization gain, and input scaling to maintain low activation kurtosis, which empirically suppresses divergence and loss-blowup (Hernández-Cano et al., 26 May 2025, Fishman et al., 2024).
"Smooth-SwiGLU" is developed to guard against alignment-induced spikes by enveloping the quadratic nonlinearity in a per-channel scaling block: . This preserves output numerics while capping activation spikes within representable FP8 range. With this, trillion-token, 7B-parameter decoders achieve stable convergence on par with BF16 (Fishman et al., 2024).
General stability recommendations include tracking activation kurtosis as a health metric, monitoring loss landscape sharpness, using dynamic precision scheduling, and limiting FP8 to non-sensitive layers where needed (Lee et al., 2024).
5. System-Level Impact and Empirical Outcomes
Across hardware and scale, FP8 computation yields substantial efficiency gains:
| Scenario | Baseline | FP8 Throughput | Speedup | Memory Savings | Accuracy Gap |
|---|---|---|---|---|---|
| LLM training (8B param) | 9.1k tok/s (BF16) | 12.8k tok/s | +40% | Match/slightly win vs BF16 (Hernández-Cano et al., 26 May 2025) | |
| Llama-7B pretraining | 12.7 samples/s (BF16, Gaudi2) | 16.9 samples/s | +34% | Adam moments:–30% | ≤0.3 pt (Fishman et al., 2024) |
| Low-rank GEMM (N=20k) | 49 TFLOPS (PyTorch FP32) | 378 TFLOPS | –75% | 1–2% rel. err. (Metere, 24 Nov 2025) | |
| CNNs (FPGA, M4E3) | 0.60–1.42 GOPS/DSP | 4 best prior | ∼50% | – | <0.5% top-1 (Wu et al., 2020) |
In large MoE and LLM pretraining, end-to-end FP8 (with optimizer and activations quantized) enables not only dramatic speedups (1.4–2 vs. BF16), but also >1.5 reduction in overall GPU memory consumption, facilitating batch-size scaling and model scaling on fixed hardware budgets (Xi et al., 2024, Peng et al., 2023, Wang et al., 4 Nov 2025). COAT and FP8-LM further integrate quantized optimizer moment and mixed-granularity activation solutions that minimize quantization error via dynamic range expansion or per-group scaling (Xi et al., 2024, Peng et al., 2023).
In the context of reinforcement learning, low-precision rollout stacks with blockwise FP8 quantization for both weights and KV cache, and token-level importance sampling corrections, provide up to 44% end-to-end speedups with negligible reward and policy learning drift (Qiu et al., 26 Jan 2026).
6. Deployment, Best Practices, and Hardware Guidelines
- Scaling granularity: Prefer per-tensor or (for maximal accuracy) per-channel scales, especially for weights with heterogeneous statistics (Fishman et al., 2024, Lee et al., 13 Mar 2025).
- Activation norm and residuals: Use post-norm (LayerScale/RMSNorm) on residual branches and frozen QK normalization for attention to avoid activation kurtosis spikes (Hernández-Cano et al., 26 May 2025).
- GEMM and grouped GEMM: Use padding-elimination (e.g., TMA-adaptive) kernels for grouped computation; batch size and layout restrictions (e.g., block_N mult of 64) apply (Su et al., 7 Aug 2025).
- Optimizer states: Quantize first moment with E4M3, second moment with E5M2 or higher range; dynamic range expansion or similar nonlinearity reduces bias (Xi et al., 2024, Fishman et al., 2024).
- FP8 is not universally robust: Without architectural and algorithmic modification (e.g., precision scheduling, smoothing, per-layer higher-precision), LLM training may experience early divergence under FP8. Monitoring activation sharpness and loss spike frequency is essential as a stability diagnostic (Lee et al., 2024).
- Hardware/ISA design: Accumulating in FP16/FP32, stochastic rounding in MACs, and supporting variable mantissa widths or per-group alignment maximizes efficiency (Ali et al., 2024, Zhao et al., 5 Feb 2026, Bertaccini et al., 2022).
7. Current Limitations and Future Directions
FP8’s efficacy in deep learning arises from workload-inherent statistical resilience and algorithmic interventions, but caution remains:
- Dynamic range and mantissa resolution trade-off: Insufficient exponent width sharply degrades stability; E4M3/E5M2 provide empirically sufficient coverage (Micikevicius et al., 2022, Lee et al., 2024).
- Long-horizon stability: Subtle pathologies (e.g., in SwiGLU) emerge only at trillions-of-token scale, requiring specialized remedies (e.g., Smooth-SwiGLU) (Fishman et al., 2024).
- Double quantization error: End-to-end casting-free FP8 dataflows in MoE must include scaling-consistent transposes and fused quantization operators to prevent compounding grid mismatch errors (Wang et al., 4 Nov 2025).
- Hardware co-design: Continued development of custom accelerators, flexible mantissa-width predictors, and optimized storage/access patterns is essential (Ali et al., 2024, Zhao et al., 5 Feb 2026).
- Numerics-consistent inference: FP8-trained models enable simple W8A8 deployments, but must coordinate quantization parameters to avoid train–inference mismatch (Narayan et al., 9 Feb 2025).
Future research will address automated per-layer adaptive precision, maintainability in extreme-scale distributed systems, and further integration of FP8 formats in scientific workloads via techniques such as Ozaki-scheme decomposition (Mukunoki, 1 Aug 2025).
References:
(Micikevicius et al., 2022, Fishman et al., 2024, Hernández-Cano et al., 26 May 2025, Wang et al., 4 Nov 2025, Xi et al., 2024, Peng et al., 2023, Shah et al., 2024, Peng et al., 2023, Lee et al., 13 Mar 2025, Su et al., 7 Aug 2025, Mukunoki, 1 Aug 2025, Zhao et al., 5 Feb 2026, Metere, 24 Nov 2025, Wu et al., 2020, Ali et al., 2024, Qiu et al., 26 Jan 2026, Blake et al., 2023, Lee et al., 2024, Bertaccini et al., 2022, Narayan et al., 9 Feb 2025).