Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual-Smoothed Fine-Grained Quantization

Updated 30 January 2026
  • Dual-Smoothed Fine-Grained Quantization (DSFQ) is a PTQ framework that applies global smoothing and local channel equalization to mitigate heavy-tail effects in large neural models.
  • It reduces quantization error by employing Hadamard rotations or percentile clipping to disperse extreme values, enabling efficient inference on commodity hardware.
  • Empirical evaluations on VGGTs and LLMs demonstrate that DSFQ achieves near full-precision accuracy with significant improvements in memory usage and speed.

Dual-Smoothed Fine-Grained Quantization (DSFQ) is an advanced post-training quantization (PTQ) framework designed to bridge the gap between accurate low-bit quantization and practical hardware-efficient inference in extremely large neural architectures, notably billion-parameter Visual Geometry Grounded Transformers (VGGTs) and multi-billion-parameter LLMs. The essential insight of DSFQ is to address the heavy-tailed activation and weight distributions that impede conventional symmetric quantization, through a two-stage combination of global distribution smoothing and local channel equalization. This enables both high fidelity and efficient deployment on commodity hardware without invasive model modifications (Feng et al., 25 Sep 2025, Zhang et al., 2023).

1. Motivation and Scope

DSFQ targets architectural and data-specific quantization obstacles in high-capacity models. In VGGTs, salient special tokens (camera/registration) generate orders-of-magnitude outlier activations, producing heavy-tailed statistics which, under standard quantization, force nearly all lesser values to collapse into quantization bins dictated by these outliers. Multi-view calibration further exacerbates inter-sample dynamic range variance. Similarly, transformer-based LLMs suffer from activation and weight channels that sporadically exhibit extreme values, penalizing coarse quantization methods.

Empirical analysis demonstrates that:

  • Heavy-tailed activations cause global quantization scales to be dictated by a minority of values, degrading most of the tensor’s representation (Feng et al., 25 Sep 2025).
  • Channel-wise and token-wise dynamic range vary after even global preprocessing, invalidating naive post-rotation quantization (Feng et al., 25 Sep 2025).
  • Fine-grained quantization schemes offer superior accuracy but break efficient inference (e.g., integer GEMM) due to incompatible scale decomposition for hardware execution (Zhang et al., 2023).

DSFQ addresses these by combining global smoothing operations (Hadamard rotation or percentile clipping) with local channel (or group) rescaling, systematically removing the bottleneck imposed by distribution outliers.

2. Global Smoothing Transformations

DSFQ’s first-stage smoothing aims to "Gaussian-ize" the input statistics, distributing outlier magnitude across all coordinates and mitigating heavy tails. For VGGTs, this is accomplished by a pre-global Hadamard rotation:

  • Let XRn×dinX \in \mathbb{R}^{n \times d_{in}} and WRdout×dinW \in \mathbb{R}^{d_{out}\times d_{in}} be activations and weights.
  • Apply Hadamard matrix H{±1}din×dinH \in \{\pm 1\}^{d_{in} \times d_{in}}, so X=XHX' = XH, W=WHW' = WH.
  • This transformation, justified via the Central Limit Theorem, converges the marginal distributions toward Gaussianity, decreasing kurtosis and dispersing singular outlier coordinates across the full channel dimension.

For LLMs, group-wise percentile clipping performs a similar role:

  • For activation channel jj, compute zj=maxiXj,iz_j = \max_i |X_{j,i}|.
  • Set clipping threshold TT as the 99.5-th percentile, and scale by kj=max(1,zj/T)k_j = \max(1, z_j/T).
  • Data is smoothed by X~j,:=Xj,:/kjX̃_{j,:} = X_{j,:} / k_j, suppressing extreme outlier channels to tighten overall dynamic range.

These mechanisms suppress the hardware- and quantizer-dominating effects of rare, extreme values.

3. Local Channel or Group Smoothing

After global smoothing, residual channel or group dynamic ranges are equalized by channel-wise (VGGT) or group-wise (LLM) scaling:

  • For VGGTs, per-channel scaling factors are computed as c^i=AiαBi1α\hat{c}_i = A_i^\alpha B_i^{1-\alpha} with Ai=maxjXj,iA_i = \max_j |X'_{j,i}|, Bi=maxkWk,iB_i = \max_k |W'_{k,i}|, and α\alpha typically set to $0.5$ (Feng et al., 25 Sep 2025).
  • The rescaled versions are X=XDiag(c^)1X'' = X' \operatorname{Diag}(\hat{c})^{-1} and W=Diag(c^)WW'' = \operatorname{Diag}(\hat{c}) W'.
  • For LLMs, scales are computed per group and per output channel in a two-phase grid search, minimizing reconstruction error and enforcing hardware-imposed bounds (Zhang et al., 2023).

This localized normalization ensures every channel/group is quantized with suitably matched scale parameters, minimizing the worst-case quantization error and enabling fine-grained quantization.

4. Integrated DSFQ Quantization Pipelines

VGGT (Visual Geometry Grounded Transformer) Flow

Pseudocode for each layer:

1
2
3
4
5
6
7
8
9
10
11
12
13
Input: full-precision weights W  ℝ^{d_out×d_in}, calibration activations {X^ℓ}_[ℓcalib]
Hyper-params: α[0,1]
1. Draw/Fix Hadamard matrix H  {±1}^{d_in×d_in}
2. W  W H
3. For each X^ℓ: X^ℓ  X^ℓ H
4. For i in 1d_in:
    A_i  max_{ℓ,j} |X^ℓ_{j,i}|
    B_i  max_{k} |W_{k,i}|
    \hat c_i  A_i^α · B_i^{1α}
5. X^ℓ  X^ℓ · diag(\hat c)^{-1}
   W    diag(\hat c) · W
6. Quantize W per output channel; quantize X per token
7. Fuse H and diag(\hat c) with adjacent layers at inference

LLM (Dual-Grained Quantization) Flow

Key steps:

  • Fine-grained INT4 quantization: group input channels, minimize error for scale/zero-point via grid search.
  • Dequantize to INT8 at group level using per-group scales, enabling hardware-efficient INT8 GEMM.
  • Activation smoothing by percentile clipping and channel-wise scale division.
  • Two-phase grid search for group- and channel-wise scales, parallelized across groups/channels.

Both pipelines allow hardware-friendly integration, matching fine-grained accuracy to GEMM-friendly coarse representation.

5. Theoretical Analysis of Quantization Error

DSFQ provides provable reduction in quantization error via the μ\mu-coherence bound. For vector xRgx\in\mathbb{R}^g,

maxjxjμx2g,xx^2=O(μx2g)\max_j |x_j| \leq \mu \frac{\|x\|_2}{\sqrt{g}}, \quad \|x - \hat{x}\|_2 = O\left(\mu \frac{\|x\|_2}{\sqrt{g}}\right)

Global Hadamard rotation reduces μ\mu by din\sqrt{d_{in}}; local channel smoothing yields roughly equal maxx(i)\max |x^{(i)}| across channels. Thus, DSFQ achieves lower error bounds than naive PTQ. This smooths the distribution, making quantization robust to large but rare outliers and cross-channel variance (Feng et al., 25 Sep 2025).

A plausible implication is that similar theoretical guarantees could generalize to any architecture where activation or weight distributions suffer from heavy-tailed statistics or channel/group variance.

6. Implementation Characteristics and Hardware Integration

DSFQ’s practical design includes:

  • Selection of α=0.5\alpha=0.5 for scale blending (Feng et al., 25 Sep 2025).
  • Calibration of scales and quant steps via gradient descent using a small set of samples (≈40 for VGGT, 1000 tokens for LLM).
  • Block-wise quantization for deep networks, with calibration error minimization.
  • Outlier-resistant calibration sets provided by Noise-Filtered Diverse Sampling (for VGGTs), enhancing cross-view stability.
  • In LLMs, conversion of all group-wise INT4 quantized weights into a single INT8 matrix allows use of existing optimized kernels such as CUTLASS INT8 GEMM (Zhang et al., 2023).

The smoothing and scale factors are fused at runtime, incurring essentially zero inference overhead and ensuring rapid inference speeds commensurate with hardware expectations. DSFQ thus overcomes the hardware inefficiency typical of prior fine-grained quantization deployments.

7. Empirical Evaluation and Comparative Ablations

DSFQ has been extensively benchmarked:

VGGTs (Camera Pose, Point-Map)

Setting Metric Value
16/16 FP AUC@30 (Co3Dv2/20) 90.0
8/8 Naive PTQ AUC@30 ≈89.1
8/8 DSFQ AUC@30 89.6
4/4 Naive PTQ AUC@30 ≈81.6
4/4 DSFQ AUC@30 88.2

On DTU point map estimation:

  • DSFQ: Accuracy = 1.282 vs. FP = 1.185.

Memory and speed gains:

  • VGGT-1B, 8-bit: ≈1.93× reduction, ×2.17 speedup.
  • VGGT-1B, 4-bit: ≈3.65× reduction, ×2.49 speedup.

Ablation studies:

  • Without smoothing: AUC@30 = 76.9
  • Hadamard only: 83.6
  • Scaling only: 81.9
  • Both (DSFQ): 86.9
  • Dynamic token-wise granularity: 86.9 (highest).

LLMs (OPT, LLaMA 125M — 176B)

Setting Metric Value
A8W4 DGQ PPL loss over FP16 +0.17–0.25
A8W4 SmoothQuant Relative accuracy lower
Speed (A8W4 vs A16W4) End-to-end inference 3.24×
Memory (A8W4 vs A16W4) Total reduction 1.12×

Dynamic DSFQ is within 0.5% of FP16 zero-shot accuracy; static DSFQ within 1–2%.

In summary, DSFQ achieves near-full-precision accuracy at extreme compression levels (W4A4, A8W4), with negligible overhead, surpassing prior generic quantization methods in both predictive fidelity and hardware efficiency (Feng et al., 25 Sep 2025, Zhang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-Smoothed Fine-Grained Quantization (DSFQ).