Blockwise Quantization & Microscaling
- Blockwise quantization and microscaling are data representation techniques that partition tensors into small blocks with a shared dynamic range for efficient compression.
- They enable fine-grained adaptation to local distributional statistics, supporting robust inference and sub-8-bit training for large-scale AI models.
- Advanced methods like asymmetric scaling, rotation transforms, and metadata augmentation further reduce quantization error while leveraging hardware acceleration.
Blockwise Quantization and Microscaling
Blockwise quantization and microscaling define a family of data representations and algorithms for compressing neural network weights, activations, and gradients through partitioning tensors into blocks and sharing a scale per block. Each element within a block is quantized, typically using low-bit-width integer or floating-point representations, while the block’s dynamic range is captured by an associated shared scale. Microscaling (MX), a central concept, denotes configurations where block sizes are small—often 16–32 elements—enabling fine-grained adaptation to local distributional statistics. These approaches have emerged as a critical technology in large-scale AI model deployment and training, driven by hardware advances and the need for memory and bandwidth-efficient inference and training on LLMs (Rouhani et al., 2023, Fasoli et al., 26 Jan 2026, Sharify et al., 2024, Koo et al., 2024, Hu et al., 27 Jan 2026, Su et al., 25 Jun 2025, Lee et al., 2024, Egiazarian et al., 27 Sep 2025, Chen et al., 30 Nov 2025, Elangovan et al., 7 Feb 2025).
1. Formal Definition and Data Formats
Fundamentally, blockwise quantization partitions a tensor into non-overlapping blocks of size . Within each block , the maximum absolute value is used to compute a shared scale:
where is the largest magnitude in the chosen element format. The per-block quantization maps each to a quantized code :
and dequantization reconstructs .
MX formats typically integrate:
- Scale storage: 8-bit shared (e.g., E8M0, E4M3, E5M2, or power-of-two exponents).
- Block size: commonly 16 or 32, aligned with hardware (e.g., NVIDIA Blackwell).
- Element representation: Signed INT4/6/8, FP4 (E2M1), FP6 (E2M3 / E3M2), FP8 (E4M3 / E5M2), with trade-offs between dynamic range and quantization error.
Several derivative formats augment base MX schemes, including asymmetric scaling (AMXFP4 (Lee et al., 2024)), hybrid codebooks with block clustering (LO-BCQ (Elangovan et al., 7 Feb 2025)), and metadata-augmented encoding (M²XFP (Hu et al., 27 Jan 2026)).
2. Algorithmic Workflows for Inference and Training
The practical use of blockwise quantization/microscaling involves multi-stage pipelines:
Conversion and Inference Flows (Rouhani et al., 2023, Sharify et al., 2024):
- Direct-cast inference: All weights/activations are quantized on-the-fly with per-block scales. GEMM kernels operate directly on (scale, code) tuples, accumulating outputs in FP16/32. Non-dot product ops execute in higher precision.
- Error-diffusion PTQ: Scale and quantization error are calibrated on a small dataset, propagating residuals across blocks.
- Quant-aware finetuning: float2mx quantization is inserted into forward passes during finetuning; backward is done in FP32.
- Training flow: FP32 master weights are held; all GEMMs quantize weights and activations; dot-products accumulate in FP32; gradients are quantized before the next operation; optimizer and learning rate remain unchanged.
Extensions:
- OPAL's Outlier-Preserved Quantization: Top-n outliers in each block are stored in higher precision (Koo et al., 2024).
- Mixed-precision layers: More sensitive layers may employ higher bit-widths, and robust layers may use lower (Koo et al., 2024).
- Rotation and transform preconditioning: Use blockwise Hadamard or WUSH transforms prior to quantization to redistribute outlier effects and minimize quantum error (Chen et al., 30 Nov 2025, Egiazarian et al., 27 Sep 2025).
- Block clustering with codebooks: Assign blocks to clusters with custom codebooks (BCQ, LO-BCQ), updating assignments and codebooks to minimize total quantization error (Elangovan et al., 7 Feb 2025).
Pseudocode for float2mx (basic microscaling quantization) (Rouhani et al., 2023):
1 2 3 4 5 6 7 8 9 10 |
def float2mx(V, element_format): emin_elem = exponent_min(element_format) emax_elem = exponent_max(element_format) e_shared = floor(log2(max(abs(V)))) - emax_elem s_b = 2**e_shared q = [] for v in V: y = v / s_b q.append(quantize_to_element_format(y)) return s_b, q |
3. Accuracy, Compression, and Error Characteristics
Empirical results show distinct trade-offs among bit-width, block size, and quantization format. Representative results (Rouhani et al., 2023, Sharify et al., 2024, Egiazarian et al., 27 Sep 2025, Lee et al., 2024, Hu et al., 27 Jan 2026, Chen et al., 30 Nov 2025, Elangovan et al., 7 Feb 2025):
- MXINT8 and MXFP6 achieve sub-0.5% top-1 accuracy drop on ImageNet and negligible BLEU/WER/AUC drops on translation, speech, and recommendation.
- Generative Inference: MXINT8 and MXFP6 match FP32 on GPT3-175B and LLaMA-7B within statistical error; MXFP6's drop in accuracy is < 0.01 absolute.
- Sub-8-bit training: Mixed MXFP4/MXFP6 or pure MXFP6 achieve <0.5% loss increase on generative LMs up to 1.5B parameters.
- FP4 (MXFP4, NVFP4): With naive power-of-two scales, quantization incurs significant error: e.g., MXFP4-PoT PPL ≈10.1 vs. FP16 baseline ≈6.0; MR-GPTQ or asymmetric scaling closes the PPL gap to ≈0.5 (Lee et al., 2024, Egiazarian et al., 27 Sep 2025).
- Block size: Smaller blocks typically yield lower error due to tighter scaling, up to a threshold—at very small block sizes, scale quantization granularity or distributional effects can increase error (see Section 4).
- Outlier preservation: Directly storing a handful of outliers in BF16 within each block can reduce overhead to <1% accuracy and moderate the (otherwise severe) impact of extreme elements (Koo et al., 2024).
A summary table:
| Format | Block Size | Reported Accuracy Loss | Notable Method | Source |
|---|---|---|---|---|
| MXINT8/MXFP6 | 32 | <0.5% top-1 | Direct-cast, PTQ, QAT | (Rouhani et al., 2023) |
| MXFP4-PoT | 32 | ~5%–50% task dep. | Naive, no calibration | (Rouhani et al., 2023) |
| AMXFP4-FP8 | 32 | <0.5 PPL, +3% task | Asymmetric, FP8 scale | (Lee et al., 2024) |
| MR-GPTQ (FP4) | 16/32 | ~1–2% | Rotated, GPTQ optimized | (Egiazarian et al., 27 Sep 2025) |
| M²XFP | 32 | ~1.6% (LLaMA 7B/8B) | Metadata-augmented | (Hu et al., 27 Jan 2026) |
| LO-BCQ | 8/32/64 | <0.2 PPL | Block-clustered codebooks | (Elangovan et al., 7 Feb 2025) |
4. Failure Modes, Anomalies, and Theoretical Limits
4.1 Scale Quantization Anomalies
Empirical and theoretical analysis reveal that decreasing block size below a model- and distribution-specific threshold can increase error when scale quantization is coarse (Fasoli et al., 26 Jan 2026). For example, block sizes below 16 for FP8 E4M3 scales induce a “perplexity inversion” where PPL rises instead of falling with smaller block size. The cause is the interplay between the spread of tensor distributions (specifically, low-variance/narrow tensors) and the available dynamic range of quantized scales. When the block is too "narrow," quantizing the maximum to a low-precision scale can result in either the entire block being mapped to zero (if the true max falls below the smallest representable scale) or max error dominating the overall MSE.
Theoretical modeling attributes this to three contributors:
- Non-maximum element error (amplified by scale quantization granularity)
- Error from quantizing the block maximum itself
- All-zero block error (entire block mapped to quantized zero under some conditions)
4.2 Asymmetry and Clamping
Microscaling suppresses outliers but induces block-level asymmetry: when mean values in small blocks drift from zero, symmetric quantizer grids waste range coverage, resulting in increased rounding error (Lee et al., 2024). Solutions include:
- Asymmetric shared scaling: Use separate scales for positive and negative subblocks (AMXFP4), substantially lowering empirical MSE and improving accuracy.
- Rotation or transform-based preconditioning: Blockwise Hadamard or optimal WUSH transforms can equalize distribution and improve quantization robustness (Chen et al., 30 Nov 2025, Egiazarian et al., 27 Sep 2025).
5. Advanced Techniques: Rotation, Metadata, Outlier Handling
5.1 Micro-Rotated Quantization and Transforms
MR-GPTQ realizes significant FP4/NVFP4 accuracy boosts by blockwise Hadamard transforms and format-specific scale optimization (Egiazarian et al., 27 Sep 2025). WUSH derives the provably optimal blockwise linear transform for round-to-nearest, absmax quantizers, further minimizing loss (Chen et al., 30 Nov 2025).
5.2 Metadata-Augmented Formats
M²XFP introduces minimal block or subgroup metadata to locally refine quantized values. Subgroup-level mantissa (Sg-EM) metadata is used for weights, and top-1 element correction (Elem-EM) for activations, reducing average accuracy loss by >70% compared to MXFP4 at ~0.25 bits/element overhead (Hu et al., 27 Jan 2026).
5.3 Outlier Preservation
OPAL’s architecture reserves higher-precision representation for a small number of outliers per block (top-4 in k=128 blocks), with the majority quantized to 3–5 bits (Koo et al., 2024). This yields <1 PPL increase and area/power savings of 2.4–3.1x.
5.4 Clustered Codebooks (LO-BCQ)
LO-BCQ iteratively assigns blocks to clusters, updating custom codebooks with per-cluster scaling for each, down to 0.2 PPL loss on LLMs in the W4A4 regime (Elangovan et al., 7 Feb 2025).
6. Hardware and Software Integration
Efficient deployment of blockwise quantization/microscaling depends on tight hardware/software co-design:
- Tensor-Core Extension: Hardware must support blockwise scale fetches and mixed exponent/mantissa alignment. MX block formats align with OCP and NVIDIA Blackwell architectures (Rouhani et al., 2023, Su et al., 25 Jun 2025).
- Metadata/Hybrid Handling: M²XFP’s extra metadata processing and OPAL’s outlier buffer additions add minimal (≤10%) area and operate off the critical path (Hu et al., 27 Jan 2026, Koo et al., 2024).
- Vectorized Kernels: All leading libraries (MX-Lib, QuTLASS) provide fused block quantization, scale application, and matrix multiply kernels for NVIDIA/AMD GPUs (Sharify et al., 2024, Egiazarian et al., 27 Sep 2025).
- Energy and Throughput: Typical observed gains include 1.5–2x throughput vs. FP16/32 (INT8/FP6/FP4), with energy reduction up to 2.2x and area savings up to 3.1x (Koo et al., 2024, Hu et al., 27 Jan 2026).
7. Training Stability, Mitigation Strategies, and Practical Guidelines
Full-duration sub-8-bit training in MX formats exhibits a propensity for stochastic instability in gradient updates and irrecoverable divergences, particularly with increasing model and compute scale (Su et al., 25 Jun 2025). This is traced to multiplicative bias from quantized gradients in LayerNorm and activations. Stability is restored by:
- Using higher-precision activations/LayerNorm and only quantizing weights in the backward.
- Employing forward-only quantization or switching to higher-precision mid-training as soon as error metrics (e.g., estimated operator-norm of gradient noise) breach thresholds.
Practical guidelines include:
- Use block size 32 for hardware alignment (NVIDIA Blackwell).
- For 8-bit, MXINT8 or MXFP8 with direct PTQ is sufficient.
- For 4–6 bit, combine SmoothQuant, GPTQ, and/or MR-GPTQ, especially for FP4 formats.
- Avoid shrinking block size below 16 without appropriate scale representation (use FP8 UE5M3, not just E4M3, for FP4) (Fasoli et al., 26 Jan 2026).
- For critical workload stability in training, maintain BF16 or FP32 for activations and LayerNorm where possible. Monitor gradient error and adapt quantization schedules accordingly.
References:
(Rouhani et al., 2023, Fasoli et al., 26 Jan 2026, Sharify et al., 2024, Koo et al., 2024, Hu et al., 27 Jan 2026, Su et al., 25 Jun 2025, Lee et al., 2024, Egiazarian et al., 27 Sep 2025, Chen et al., 30 Nov 2025, Elangovan et al., 7 Feb 2025).