Papers
Topics
Authors
Recent
Search
2000 character limit reached

Is Finer Better? The Limits of Microscaling Formats in Large Language Models

Published 26 Jan 2026 in cs.LG, cs.AR, and cs.CL | (2601.19026v1)

Abstract: Microscaling data formats leverage per-block tensor quantization to enable aggressive model compression with limited loss in accuracy. Unlocking their potential for efficient training and inference necessitates hardware-friendly implementations that handle matrix multiplications in a native format and adopt efficient error-mitigation strategies. Herein, we report the emergence of a surprising behavior associated with microscaling quantization, whereas the output of a quantized model degrades as block size is decreased below a given threshold. This behavior clashes with the expectation that a smaller block size should allow for a better representation of the tensor elements. We investigate this phenomenon both experimentally and theoretically, decoupling the sources of quantization error behind it. Experimentally, we analyze the distributions of several LLMs and identify the conditions driving the anomalous behavior. Theoretically, we lay down a framework showing remarkable agreement with experimental data from pretrained model distributions and ideal ones. Overall, we show that the anomaly is driven by the interplay between narrow tensor distributions and the limited dynamic range of the quantized scales. Based on these insights, we propose the use of FP8 unsigned E5M3 (UE5M3) as a novel hardware-friendly format for the scales in FP4 microscaling data types. We demonstrate that UE5M3 achieves comparable performance to the conventional FP8 unsigned E4M3 scales while obviating the need of global scaling operations on weights and activations.

Summary

  • The paper demonstrates that reducing block sizes below a certain threshold can increase quantization error, challenging conventional assumptions.
  • It develops a probabilistic framework that decouples per-block and global errors, validated by experiments on various LLM benchmarks.
  • The study proposes a UE5M3 format that repurposes the unused sign bit in FP8 scaling to extend dynamic range and restore error monotonicity.

Limits and Error Dynamics of Finer-Grained Microscaling Quantization in LLMs

Problem Statement and Motivation

The paper "Is Finer Better? The Limits of Microscaling Formats in LLMs" (2601.19026) systematically investigates the error behavior and practical limitations of microscaling quantization in LLMs, as the block size of quantization is reduced and scale precision is lowered. By decoupling the impact of low-bit quantization on model quality into per-block and global effects, the authors identify a counter-intuitive regime: block-wise quantization error increases when block sizes are reduced below a model- and format-dependent threshold. This challenges the prevailing assumption that smaller blocks decrease quantization error due to finer locality.

Microscaling formats—used widely in AI hardware for their efficiency—supposedly benefit from decreasing the block size, enabling lower error at reduced precision. Commercial accelerators have recently advanced aggressive quantization strategies, supporting domain-specific low-precision floating point (e.g., FP4, FP8) and block-based scaling factors. However, as model weights (and activations) become increasingly narrow, the fundamental error tradeoffs between block size, scaling factor precision, and the distributional properties of tensors enter uncharted territory. The main results of the paper directly address this gap, providing experimental and theoretical analysis of the observed anomalies.

Empirical Observation and Quantization Anomaly

In empirical evaluations across LLMs (including several 8–9B parameter transformer models and variants with distinct distributional properties), the expected decrease in quantization error and perplexity with reduced block size is validated for high-precision (BF16) block scales. However, when employing hardware-targeted FP8 scaling (notably unsigned E4M3, or UE4M3), a reversal is observed: decreasing the block size below a threshold increases the perplexity gap, sometimes termed "perplexity inversion". Figure 1

Figure 1: FP4 microscaling quantization with BF16 scales shows monotonic perplexity improvement as block size decreases, while the UE4M3 scale induces non-monotonicity with substantial inversion at small block sizes.

Detailed per-block MSE analysis shows that blocks within certain weight tensors (those with lower σ\sigma) exhibit higher errors at block size 8 versus 16, not explained by classical quantization error models. This non-monotonicity and model-dependent variation is absent when high-precision (unquantized, e.g., BF16) scales are utilized—it is fundamentally a product of scale quantization. Figure 2

Figure 2: (a) Per-block MSE comparison for block sizes 8 vs 16 shows a prevalence of higher error with smaller blocks. (b/c) Per-tensor MSE vs standard deviation σ\sigma of weight tensors, FP8 UE4M3 scales emphasize the error crossover; BF16 scales do not.

Theoretical Framework and Error Source Decoupling

The authors develop a probabilistic framework for quantization error in block-wise formats, covering both floating-point and integer element types, with exact and quantized scaling factors. The framework roots in evaluating the MSE as a function of:

  • Distributional width of tensor blocks (σ\sigma)
  • Block size NN
  • Scaling factor quantization (precision and dynamic range)

Crucially, the analysis distinguishes between: (1) Error for non-max elements (xixmaxx_i \neq x_{\max}),

(2) Error for scale-setting element (xi=xmaxx_i = x_{\max}, which is zero only with infinite-precision scale),

(3) Error when the maximum is so small that all elements round to zero due to limited scale dynamic range.

Across models and synthetic data, the error behavior is dominated by (1) for wide distributions, but as σ\sigma shrinks, (2) and (3) become significant—especially so at smaller block sizes, where the probability of small xmaxx_{\max} increases. The result is a crossover point where reducing block size actually increases error for narrow distributions. Figure 3

Figure 3: (a) MSE-σ\sigma dependency of pretrained model weights matches theoretical predictions based on the normal distribution. (c) Decomposition shows which error sources dominate across σ\sigma.

Figure 4

Figure 4: Comparison of block-wise MSE for block sizes 8 vs 16; majority of points above diagonal reflects higher error for smaller blocks.

Implications for Quantization Strategy and Formats

The identification of non-monotonic scaling error has both theoretical and hardware implications:

  • There exists a precision/dynamic range regime for which block size reduction is not universally beneficial, and the choice of scaling format is pivotal.
  • Models with weight and activation tensors exhibiting narrow distributions are inherently more vulnerable to this anomaly. Thus, distribution-aware quantization policies (possibly combining block structure with global information) are essential.
  • Per-tensor scaling can restore the monotonic error decrease with block size, but at the cost of hardware complexity and susceptibility to outliers. Figure 5

    Figure 5: Across LLMs, the proposed UE5M3 scaling achieves perplexity on par with per-tensor scaled UE4M3, without the hardware or runtime overhead.

Hardware-Friendly Solution: Extended Scale Dynamic Range

Based on the theoretical analysis, the authors propose to repurpose the unused sign bit in FP8 UE4M3 scaling to create an unsigned E5M3 (UE5M3) format—doubling the exponent width and thus drastically increasing the representable dynamic range. This modification allows blocks, including those with very small xmaxx_{\max}, to select finer scales, mitigating both the rounding-to-zero and the non-max element error sources.

UE5M3 can be implemented with minimal hardware changes (increased exponent adder width, no added mantissa complexity), as demonstrated through synthesis of a PE array at advanced process nodes. Compared to per-tensor scaling, UE5M3 preserves error monotonicity without the need for global scaling, avoids outlier vulnerability, and yields strong perplexity/accuracy parity with the BF16 baseline across several benchmarks. Figure 6

Figure 6: (a) Hardware block diagram for integrating UE5M3 scaling. (b) Perplexity vs block size, showing elimination of inversion with UE5M3.

Future Directions

The results suggest multiple avenues for further research:

  • Generalization to other ultra-low precision formats, including INT4, FP6, and dynamic/learned scaling variants.
  • Adaptive block size or block partitioning based on distributional statistics at quantization time.
  • Layer- or tensor-type specific quantization strategies, considering their empirical tendency to exhibit narrow or heavy-tailed distributions.
  • Exploration of stochastically or deterministically mixed scaling policies, blending UE5M3 with classic per-channel or per-tensor scaling in deployment.

Conclusion

This paper provides a rigorous explanation for the observed quantization anomaly in block-wise microscaling, precisely connecting the phenomenon to scale quantization and tensor distribution width. The theoretical framework presented closely matches empirical model behavior and enables principled decisions about format and block size in next-generation LLM deployment. The introduction and hardware proof-of-concept of UE5M3 scaling offers a practical solution, reconciling both efficiency and accuracy requirements for sub-8-bit quantization in large-scale inference and training accelerators.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.