INT4+FP8 Mixed-Precision Quantization
- The INT4+FP8 variant is a mixed-precision quantization scheme that uses per-group symmetric INT4 weight quantization with FP8 activations to balance storage reduction and dynamic range robustness.
- It employs scale alignment, outlier smoothing, and low-rank compensation techniques to manage quantization error, achieving performance metrics close to unquantized FP16 models.
- Hardware-software co-design with fused dot-products and optimized tensor cores delivers 3–5× speedups in attention mechanisms and significant memory bandwidth savings.
The INT4+FP8 variant refers to a mixed-precision quantization scheme for deep neural network inference and, in some cases, training, in which some tensors (typically weights) are quantized to signed 4-bit integers (INT4) and others (typically activations or specific intermediate results) to 8-bit floating-point (FP8) formats. This approach exploits the memory and compute efficiency of ultra-low precision representations while using floating-point's robustness to outliers and range for sensitive components. INT4+FP8 has emerged as a leading configuration for LLMs and attention accelerators on modern hardware with native FP8 and INT4 tensor-core support.
1. FP8 and INT4 Number Formats
FP8 and INT4 represent distinct quantization paradigms:
- FP8: In practice, the E4M3 format is most frequently deployed. Each FP8 value is encoded as one sign bit, four exponent bits (bias 7), and three mantissa bits. The dynamic range is approximately ; subnormals are supported, while NaN/inf encodings are omitted to minimize hardware footprint. Normal values are decoded as
- INT4: Signed 4-bit two's-complement: range . Quantization is symmetric, parameterized by per-group or per-channel scale , with optional zero-point in generalized usage. Quantization proceeds as , and dequantization as .
FP8 excels at representing wide dynamic range and outlier values; INT4 offers maximal storage and bandwidth reduction for tensors with narrow value distributions (Zhang et al., 2023).
2. Mixed-Precision Quantization Pipeline
Mixed-precision INT4+FP8 quantization schemes treat weights and activations asymmetrically to optimize both storage and dynamic range management:
- INT4 Weight Quantization: Imposes per-group symmetric quantization (group sizes 32–128), computing . All weights within a group are quantized against this .
- FP8 Activation Quantization: Per-token or per-tensor scaling is applied. For a vector , compute , and quantize to E4M3.
- Scale-Alignment: For high-throughput hardware (NVIDIA Hopper, etc.), the scales for INT4 and FP8 operands must align (typically to powers of two) to enable dequantization in fused pipelines (Wu et al., 2023, 2505.20839).
- Outlier Smoothing and Block Centering: Outlier elements in are centered (subtracting mean) before quantization; missing mean terms are restored in the output via a single GEMV (Zhang et al., 2024).
- Low-Rank Compensation (LoRC): For small and medium models, the quantization error is corrected post-facto by storing a truncated SVD of the quantization residual (Wu et al., 2023).
A representative forward pass for a Transformer block may structure its quantization assignments as follows:
| Component | Format | Quantization Granularity | Scale |
|---|---|---|---|
| Linear Weights | INT4 | Per-group (g=32–128) | |
| Linear Activations | FP8 E4M3 | Per-token or block | |
| Attention Q, K | INT4 | Per-warp (thread-level) | |
| Attention P, V | FP8 | Block or channel | see above |
This layout is leveraged in state-of-the-art LLM quantization, attention kernels, and accelerator designs (Wu et al., 2023, 2505.20839, Zhang et al., 2024).
3. Hardware Implementations and Kernel Co-Design
Optimized INT4+FP8 support requires explicit architectural features in both software kernels and hardware datapath design:
- FPGA/RTL Fusion: The “Configurable Mixed-Precision Fused Dot Product Unit” (Rout et al., 19 Nov 2025) fuses integer and floating-point multipliers, exposable via a mode bus. The INT4+FP8 implementation utilizes a single Wallace-tree multiplier shared across pipeline stages, supporting both formats and fused accumulation.
- Tensor Cores: On NVIDIA Hopper and Ampere GPUs, INT4 and FP8 operands are loaded from shared memory, dequantization is effected via fused LUTs or direct instructions, and GEMM operations are dispatched with accumulator depths (FP32 or FP22). Conversion between quantized and dequantized domains is performed in-register.
- Specialized Attention Kernels: SageAttention2's per-thread INT4 QK and FP8 PV matmuls (utilizing mma.f32.f8.f8.f32 on Hopper) achieve peak throughputs of 288–485 TOPS, with memory bandwidth savings of 3× over baseline FP16 kernels (Zhang et al., 2024).
Notably, all designs avoid mixing INT and FP operands within a single dot-product; instead, ops are pipeline-clean, with kernel dispatch determined by operand tags (Zhang et al., 2023).
4. Quantization Error Analysis and Mitigation
Achieving high accuracy at INT4+FP8 requires careful selection of quantization parameters and supplemental error-mitigation techniques:
- Quantization Error Bounds: INT4 uniform quantization error does not exceed ; worst-case relative error for FP8 E4M3 is for normalized values.
- Dynamic Range vs. Granularity: FP8's dynamic exponent outperforms INT8 in handling activation outliers. INT4's narrow value range demands block selection based on tensor dynamic range; where histogram or peak-to-mean ratio is high, switching to FP8 or INT8 is preferable (Zhang et al., 2023).
- Scale Alignment: Scale factors are rounded or aligned to nearest power-of-two, which allows shifts to replace expensive multiplies at inference time without significant accuracy penalty. Method M2 (max-based group shifting) achieves PPL (Wu et al., 2023).
- LoRC and Smoothing: Rank-4–8 truncated SVD low-rank compensation, outlier mean-centering, and per-warp normalization all dramatically reduce quantization-induced accuracy loss without incurring significant compute or memory overhead (Wu et al., 2023, Zhang et al., 2024).
5. Empirical Results and Performance Benchmarks
Benchmarks demonstrate superior efficiency–accuracy trade-offs for INT4+FP8 across multiple domains:
- LLMs: For LLaMA-7B, INT4/FP8 W4A8 outperforms pure INT4/INT8, reducing average PPL by 0.14 and remaining within PPL of unquantized FP16 up to 30B parameters (Wu et al., 2023). On Llama3-8B, FireQ achieves feedforward speedups of and prefill phase improvements up to at batch size 16, with accuracy drop remaining within 1–2 points of state-of-the-art (QServe) (2505.20839).
- Attention Mechanism: SageAttention2 demonstrates $3$– kernel speedups over FlashAttention2 and xformers on RTX4090 and Hopper, incurring metric drop (WikiText/Lambda/MMLU/CLIPSIM) on models ranging from 2B to 9B parameters (Zhang et al., 2024).
- FPGA Accelerator: On Xilinx Alveo U55C, the INT4+FP8 fused dot-product pipeline sustains 9.6 GFLOPS at 298 MHz, using 16% fewer LUTs and FFs than comparable INT8+FP8 units (Rout et al., 19 Nov 2025).
These results indicate that INT4+FP8 schemes approach FP16 or mixed INT8/FP8 accuracy on a wide variety of tasks, while offering substantial gains in model size, compute, and memory bandwidth.
6. Design Guidelines and Limitations
Empirically validated best practices include:
- Subnormal Preservation: Always retain subnormals in FP8; omitting them impairs accuracy for small weights/activations (Zhang et al., 2023).
- Pipeline Separation: Avoid fusing INT and FP operands in a single MAC; use distinct pipelines or kernel dispatch (Zhang et al., 2023).
- Granularity Selection: Deploy INT4 on tensors with restricted dynamic range or quantile spread; default to FP8 (E3M4/E4M3) otherwise. Per-layer or per-block selection can be automated using MSE or “resolution-aware” surrogate objectives (Zhang et al., 2023).
- Adaptive Precision Fallback: Allow automatic upgrade (e.g., layer/timestep “INT8+FP8” fallback) for high-error components; this vanishes the empirical accuracy gap at minimal cost (Zhang et al., 2024).
- Control-Path Simplicity: Hardware FSMs and mode-buses can be extended to add new low-precision format pairs, with negligible RTL modifications (Rout et al., 19 Nov 2025).
Limitations observed:
- Not all hardware supports both INT4 and FP8 tensor-core paths (pre-Ampere GPUs are not compatible).
- Calibration and smoothing overheads, while low, can become non-negligible for extremely short sequence lengths or very narrow matrix blocks.
- For highly skewed or dynamic tensors, maintaining accuracy may require leaving some weights/activations at 8+ bits or using LoRC/post-correction (2505.20839).
7. Research Landscape and Extensions
The INT4+FP8 paradigm generalizes beyond LLMs and attention to vision models, object detection, and segmentation tasks using the flexible assignment algorithms proposed in (Zhang et al., 2023). The core kernel and hardware recipes scale to INT2, BFLOAT4, and other reduced-precision “microscaling” representations with minimal engineering overhead (Rout et al., 19 Nov 2025). The method has rapidly become standard in post-training quantization pipelines and LLM serving infrastructure, reflecting its strong efficiency–accuracy trade-off and direct hardware support.
A plausible implication is that as further hardware platforms expose fused low-precision pipelines and software stacks expose more granular mixed-precision APIs, INT4+FP8 will continue to displace legacy INT8/FP16 modes for bandwidth- and latency-bound inference deployments in large-scale deep learning workloads.