Flexible Floating-Point 8 (FFP8) Overview
- FFP8 is a parameterized 8-bit floating-point format that allows tunable exponent and mantissa allocation to flexibly match numerical precision with dynamic range requirements.
- It achieves significant gains in energy efficiency and throughput, preserving full-precision accuracy for deep neural network training and inference.
- FFP8 integrates into modern hardware with configurable modes and advanced quantization techniques that optimize compute performance and reduce power consumption.
Flexible Floating-Point 8 (FFP8) refers to a family of parameterized 8-bit floating-point number systems that generalize and extend the rigid IEEE-754 FP8 formats by allowing a tunable assignment of bits to the exponent and mantissa fields, potentially adjustable exponent bias, and—in some proposals—even configuration of field boundaries at runtime or per-layer. FFP8 has emerged as a key enabler for aggressive quantization in deep neural network (DNN) training and inference, low-power embedded systems, high-throughput HPC workloads, and as an abstraction for SIMD and digital compute-in-memory (CIM) hardware units, providing a flexible tradeoff between dynamic range and numerical precision. The ability to match representation precision to the statistical properties of specific tensors or kernels allows one to sustain full-precision (FP32) accuracy in practice while achieving order-of-magnitude gains in compute throughput, energy, and bandwidth (Wang et al., 2018, Noune et al., 2022, Zhang et al., 2023, Zhao et al., 5 Feb 2026). FFP8 stands in contrast to classic INT8/fixed-point quantization by affording exponent-driven dynamic range scaling, which is particularly advantageous for tensors exhibiting heavy-tailed or outlier-prone distributions (Kuzmin et al., 2022, Micikevicius et al., 2022).
1. Mathematical Structure and Format Variants
A FFP8 number is typically defined by a triplet or quadruple of parameters: the number of sign bits (, usually 1 or 0), exponent bits (), mantissa/fraction bits (), and an exponent bias (), such that . Key instantiations include:
| Notation | Exponent bits () | Mantissa bits () | Bias () | Max normal value | Min normal value | Relative precision |
|---|---|---|---|---|---|---|
| E5M2 | 5 | 2 | 15 | |||
| E4M3 | 4 | 3 | 7 | $448$ | ||
| E3M4 | 3 | 4 | 3 | $15.5$ | ||
| E2M5 | 2 | 5 | 1 | $3.9375$ | $1$ |
The value represented by an FFP8-encoded word is decoded as:
- Normalized (): ,
- Subnormal (): , where is the sign bit, the unsigned exponent field, the fraction field (Micikevicius et al., 2022, Wang et al., 2018).
More advanced schemes, such as floating–floating-point (F2P) (Cohen et al., 2024) or tapered-precision Takum (Hunhold, 18 Mar 2025), dynamically vary the exponent/mantissa split per value or per range, achieving locally optimal SNR and minimal mean squared error over wide magnitude spans.
2. Quantization, Rounding, and Accumulation Methods
FFP8 leverages flexible quantization algorithms that map high-precision real values to the closest representable FFP8 value, possibly with per-tensor or per-layer scaling (bias) to “center” the range and reduce quantization or saturation artifacts (Noune et al., 2022, Huang et al., 2021). The conversion process entails:
- Scaling: for a suitable scale ,
- Exponent extraction and bias adjustment: ,
- Mantissa rounding: either round-to-nearest or stochastic rounding, with stochastic methods shown to significantly preserve convergence and final accuracy for accumulators and updates (Wang et al., 2018).
- Special value handling: In E5M2, all-1 exponents denote NaN/Inf; E4M3 collapses most such codes to a single NaN, or uses all codes for normals (extended dynamic range) (Micikevicius et al., 2022).
For high-accuracy dot-products and accumulations in reduced precision, chunk-based accumulation splits long summations into blocks (chunks), accumulates each in a slightly higher precision (e.g., FP16), then combines the chunk results. This observably lowers swamping risks and bounds the theoretical error to , with optimal chunk sizes CL in in modern DNNs (Wang et al., 2018). Weight updates, especially in momentum SGD, benefit from FP16 accumulators with stochastic rounding to avoid losing gradient contributions below the coarse FFP8 quantization threshold.
3. Hardware Implementations and Microarchitectural Integration
FFP8 is designed for seamless deployment in digital hardware, including SIMD datapaths, transprecision FPUs, and digital CIM arrays. Parameterized FFP8 units (e.g., FPnew (Mach et al., 2020), various accelerator proposals (Zhao et al., 5 Feb 2026, Zhang et al., 2023)) support:
- Dynamically or statically configurable encoding parameters,
- Efficient SIMD (e.g., 8-wide parallel lanes on 64-bit datapaths),
- Sharable hardware across E5M2/E4M3/E3M4/E2M5 modes via minimal area and power overhead; even full support for multiple subformats constitutes <5% area increment over rigid FP8-only units (Zhang et al., 2023, Zhao et al., 5 Feb 2026),
- Subnormal support and per-lane decoding with low-latency critical paths,
- Linear or branch-free mantissa alignment in CIM or MAC arrays achieving TFLOPS/W energy efficiency (Zhao et al., 5 Feb 2026).
The FPnew architecture and similar proposals validate that FP8 units present 2–4× energy and area reduction compared to FP16, with silicon-measured efficiency up to $2.95$ TFLOPS/W at 14× FP64 throughput (Mach et al., 2020). Pointer-based shift units and decomposed MAC arrays further minimize latency/energy penalties in digital CIM, supporting dynamic per-group bitwidth assignment (Zhao et al., 5 Feb 2026).
4. Impact on Deep Learning and Quantized Inference
FFP8 provides full-precision accuracy for training and inference across a range of DNN workloads, including CNNs, RNNs, and Transformer-based LLMs, typically with zero or degradation relative to 16- or 32-bit floating-point baselines. Several empirical results demonstrate:
- For ImageNet, ResNet-18/50, and transformer models, mixed FFP8 (e.g., E4M3 for activations/weights, E5M2 for gradients) achieves statistical equivalence to FP32/FP16 references (Noune et al., 2022, Micikevicius et al., 2022),
- Flexible per-layer or per-tensor tuning of minimizes quantization error and maximizes represented dynamic range for each tensor, with no retraining required (Huang et al., 2021, Zhang et al., 2023),
- For post-training quantization of LLMs, FP8 activation quantization (especially E4M3 or flexible variants) consistently outperforms INT8, and FP4→FP8 quantized weights via simple exponent bit-shifting maintain accuracy with negligible loss (Wu et al., 2023),
- Use of advanced FFP8 schemes, such as variable-split F2P or Takum tapered-precision, further reduces error on outlier-prone or dynamic workloads (federated learning, telemetry) compared to any single fixed 8-bit float (Cohen et al., 2024, Hunhold, 18 Mar 2025).
Summary accuracy and hardware outcomes are as follows:
| Workload | FFP8 Config | ΔAccuracy / PPL vs FP32 | Hardware Efficiency |
|---|---|---|---|
| ResNet-50/ImageNet | E4M3 act, E5M2 grad | <0.1% | $0.80$ pJ/FOP (FP8 SIMD FMA) (Mach et al., 2020) |
| Transformer LLM | E4M3 act, E2M1→E5M2 wt | <0.5 PPL | 2× FP16 throughput (NVIDIA H100) (Wu et al., 2023) |
| MobileNet | Best per-layer FFP8 search | <0.5% | 12% lower runtime, 27% lower BW (Tagliavini et al., 2017) |
| Llama-7b | DSBP FFP8 in CIM | <1% tradeoff | 33.7 TFLOPS/W, <0.5% loss (Zhao et al., 5 Feb 2026) |
FFP8's advantage over INT8/fixed-point arises from adaptive exponent granularity, allowing post-training quantization to preserve information in outlier-prone or high-dynamic range layers.
5. Mixed-Precision and Flexible Frameworks
Flexible FFP8 underpins a range of mixed-precision frameworks, where each layer, activation, or weight tensor can be assigned the most suitable split or format, automatically selected by data-driven search or heuristics:
- Layer-wise optimization matches per-tensor exponent/mantissa splits to observed data statistics (Huang et al., 2021, Zhang et al., 2023),
- Resolution-aware search, using average squared resolution per candidate, accelerates format selection by 5–10× while maintaining of optimum accuracy (Zhang et al., 2023),
- Dynamic precision selection in interactive LLM serving is realized by, e.g., NestedFP, which enables seamless toggling between FP8 and FP16 arithmetic within a unified memory representation, with hybrid mode switching based on service-level objectives (Lee et al., 29 May 2025).
This approach delivers robust accuracy even on highly quantized hardware, while maximizing compute and bandwidth efficiency.
6. Extensions: Variable-Split and Tapered Precision FFP8
Recent extensions of FFP8 propose runtime-variable exponent/mantissa splits or "tapered-precision" such as Takum and floating–floating-point (F2P) (Hunhold, 18 Mar 2025, Cohen et al., 2024). These designs:
- Allow the bitwidth allocated to the exponent to change per value or range, allocating more mantissa bits near the numerical center (for high-precision), and more exponent bits for large/small-magnitude outliers,
- Exploit regime fields or hyper-exponent bits to control the field boundary,
- Outperform rigid E4M3/E5M2 IEEE subtypes in both mean and worst-case error for workloads with variable magnitude distributions,
- Integrate into major SIMD ISA extensions (e.g., AVX10.2) as a configurable mode, requiring minimal extra hardware logic ( area increase) (Hunhold, 18 Mar 2025).
- Empirical results across SuiteSparse matrices and deep networks show tapered-precision FFP8 eliminates catastrophic overflow and narrows error histograms by an order of magnitude relative to standard float8.
This class of FFP8 designs provides a path toward further generalization and adoption as the de facto low-precision number system in both ML and general-purpose high-throughput computation.
7. Design Trade-offs and Best Practices
Selecting FFP8 sub-formats involves:
- Balancing dynamic range (needs for gradients, outliers, or deep models) versus quantization resolution (important for activations/weights),
- Exploiting per-tensor or per-layer scaling and mixed-format assignment (e.g., E4M3/E3M4 for most tensors, E5M2 for exceptional ranges),
- Prioritizing hardware simplicity for deployment: unified decoding, chunked or pointer-based accumulation, subnormal support,
- Enabling stochastic rounding for accumulations and weight updates, especially if accumulations are performed at reduced width.
In most deployments, per-tensor scaling and standard nearest-even rounding suffice to achieve full-precision parity. Hardware and software co-design, including compiler integration for format/tensor mapping, supports efficient deployment pipelines (Zhang et al., 2023, Noune et al., 2022).
References
- (Wang et al., 2018) Training Deep Neural Networks with 8-bit Floating Point Numbers
- (Noune et al., 2022) 8-bit Numerical Formats for Deep Neural Networks
- (Zhang et al., 2023) Exploring the Potential of Flexible 8-bit Format: Design and Algorithm
- (Mach et al., 2020) FPnew: An Open-Source Multi-Format Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing
- (Huang et al., 2021) All-You-Can-Fit 8-Bit Flexible Floating-Point Format for Accurate and Memory-Efficient Inference of Deep Neural Networks
- (Tagliavini et al., 2017) A Transprecision Floating-Point Platform for Ultra-Low Power Computing
- (Kuzmin et al., 2022) FP8 Quantization: The Power of the Exponent
- (Zhao et al., 5 Feb 2026) Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction
- (Cohen et al., 2024) Floating-floating point: a highly accurate number representation with flexible Counting ranges
- (Hunhold, 18 Mar 2025) Streamlining SIMD ISA Extensions with Takum Arithmetic: A Case Study on Intel AVX10.2
- (Lee et al., 29 May 2025) NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs
- (Micikevicius et al., 2022) FP8 Formats for Deep Learning
- (Wu et al., 2023) ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats