Dual-Smoothed Fine-Grained Quantization
- Dual-Smoothed Fine-Grained Quantization (DSFQ) is a PTQ framework that applies global smoothing and local channel equalization to mitigate heavy-tail effects in large neural models.
- It reduces quantization error by employing Hadamard rotations or percentile clipping to disperse extreme values, enabling efficient inference on commodity hardware.
- Empirical evaluations on VGGTs and LLMs demonstrate that DSFQ achieves near full-precision accuracy with significant improvements in memory usage and speed.
Dual-Smoothed Fine-Grained Quantization (DSFQ) is an advanced post-training quantization (PTQ) framework designed to bridge the gap between accurate low-bit quantization and practical hardware-efficient inference in extremely large neural architectures, notably billion-parameter Visual Geometry Grounded Transformers (VGGTs) and multi-billion-parameter LLMs. The essential insight of DSFQ is to address the heavy-tailed activation and weight distributions that impede conventional symmetric quantization, through a two-stage combination of global distribution smoothing and local channel equalization. This enables both high fidelity and efficient deployment on commodity hardware without invasive model modifications (Feng et al., 25 Sep 2025, Zhang et al., 2023).
1. Motivation and Scope
DSFQ targets architectural and data-specific quantization obstacles in high-capacity models. In VGGTs, salient special tokens (camera/registration) generate orders-of-magnitude outlier activations, producing heavy-tailed statistics which, under standard quantization, force nearly all lesser values to collapse into quantization bins dictated by these outliers. Multi-view calibration further exacerbates inter-sample dynamic range variance. Similarly, transformer-based LLMs suffer from activation and weight channels that sporadically exhibit extreme values, penalizing coarse quantization methods.
Empirical analysis demonstrates that:
- Heavy-tailed activations cause global quantization scales to be dictated by a minority of values, degrading most of the tensor’s representation (Feng et al., 25 Sep 2025).
- Channel-wise and token-wise dynamic range vary after even global preprocessing, invalidating naive post-rotation quantization (Feng et al., 25 Sep 2025).
- Fine-grained quantization schemes offer superior accuracy but break efficient inference (e.g., integer GEMM) due to incompatible scale decomposition for hardware execution (Zhang et al., 2023).
DSFQ addresses these by combining global smoothing operations (Hadamard rotation or percentile clipping) with local channel (or group) rescaling, systematically removing the bottleneck imposed by distribution outliers.
2. Global Smoothing Transformations
DSFQ’s first-stage smoothing aims to "Gaussian-ize" the input statistics, distributing outlier magnitude across all coordinates and mitigating heavy tails. For VGGTs, this is accomplished by a pre-global Hadamard rotation:
- Let and be activations and weights.
- Apply Hadamard matrix , so , .
- This transformation, justified via the Central Limit Theorem, converges the marginal distributions toward Gaussianity, decreasing kurtosis and dispersing singular outlier coordinates across the full channel dimension.
For LLMs, group-wise percentile clipping performs a similar role:
- For activation channel , compute .
- Set clipping threshold as the 99.5-th percentile, and scale by .
- Data is smoothed by , suppressing extreme outlier channels to tighten overall dynamic range.
These mechanisms suppress the hardware- and quantizer-dominating effects of rare, extreme values.
3. Local Channel or Group Smoothing
After global smoothing, residual channel or group dynamic ranges are equalized by channel-wise (VGGT) or group-wise (LLM) scaling:
- For VGGTs, per-channel scaling factors are computed as with , , and typically set to $0.5$ (Feng et al., 25 Sep 2025).
- The rescaled versions are and .
- For LLMs, scales are computed per group and per output channel in a two-phase grid search, minimizing reconstruction error and enforcing hardware-imposed bounds (Zhang et al., 2023).
This localized normalization ensures every channel/group is quantized with suitably matched scale parameters, minimizing the worst-case quantization error and enabling fine-grained quantization.
4. Integrated DSFQ Quantization Pipelines
VGGT (Visual Geometry Grounded Transformer) Flow
Pseudocode for each layer:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Input: full-precision weights W ∈ ℝ^{d_out×d_in}, calibration activations {X^ℓ}_[ℓ∈calib] Hyper-params: α∈[0,1] 1. Draw/Fix Hadamard matrix H ∈ {±1}^{d_in×d_in} 2. W′ ← W H 3. For each X^ℓ: X′^ℓ ← X^ℓ H 4. For i in 1…d_in: A_i ← max_{ℓ,j} |X′^ℓ_{j,i}| B_i ← max_{k} |W′_{k,i}| \hat c_i ← A_i^α · B_i^{1−α} 5. X″^ℓ ← X′^ℓ · diag(\hat c)^{-1} W″ ← diag(\hat c) · W′ 6. Quantize W″ per output channel; quantize X″ per token 7. Fuse H and diag(\hat c) with adjacent layers at inference |
LLM (Dual-Grained Quantization) Flow
Key steps:
- Fine-grained INT4 quantization: group input channels, minimize error for scale/zero-point via grid search.
- Dequantize to INT8 at group level using per-group scales, enabling hardware-efficient INT8 GEMM.
- Activation smoothing by percentile clipping and channel-wise scale division.
- Two-phase grid search for group- and channel-wise scales, parallelized across groups/channels.
Both pipelines allow hardware-friendly integration, matching fine-grained accuracy to GEMM-friendly coarse representation.
5. Theoretical Analysis of Quantization Error
DSFQ provides provable reduction in quantization error via the -coherence bound. For vector ,
Global Hadamard rotation reduces by ; local channel smoothing yields roughly equal across channels. Thus, DSFQ achieves lower error bounds than naive PTQ. This smooths the distribution, making quantization robust to large but rare outliers and cross-channel variance (Feng et al., 25 Sep 2025).
A plausible implication is that similar theoretical guarantees could generalize to any architecture where activation or weight distributions suffer from heavy-tailed statistics or channel/group variance.
6. Implementation Characteristics and Hardware Integration
DSFQ’s practical design includes:
- Selection of for scale blending (Feng et al., 25 Sep 2025).
- Calibration of scales and quant steps via gradient descent using a small set of samples (≈40 for VGGT, 1000 tokens for LLM).
- Block-wise quantization for deep networks, with calibration error minimization.
- Outlier-resistant calibration sets provided by Noise-Filtered Diverse Sampling (for VGGTs), enhancing cross-view stability.
- In LLMs, conversion of all group-wise INT4 quantized weights into a single INT8 matrix allows use of existing optimized kernels such as CUTLASS INT8 GEMM (Zhang et al., 2023).
The smoothing and scale factors are fused at runtime, incurring essentially zero inference overhead and ensuring rapid inference speeds commensurate with hardware expectations. DSFQ thus overcomes the hardware inefficiency typical of prior fine-grained quantization deployments.
7. Empirical Evaluation and Comparative Ablations
DSFQ has been extensively benchmarked:
VGGTs (Camera Pose, Point-Map)
| Setting | Metric | Value |
|---|---|---|
| 16/16 FP | AUC@30 (Co3Dv2/20) | 90.0 |
| 8/8 Naive PTQ | AUC@30 | ≈89.1 |
| 8/8 DSFQ | AUC@30 | 89.6 |
| 4/4 Naive PTQ | AUC@30 | ≈81.6 |
| 4/4 DSFQ | AUC@30 | 88.2 |
On DTU point map estimation:
- DSFQ: Accuracy = 1.282 vs. FP = 1.185.
Memory and speed gains:
- VGGT-1B, 8-bit: ≈1.93× reduction, ×2.17 speedup.
- VGGT-1B, 4-bit: ≈3.65× reduction, ×2.49 speedup.
Ablation studies:
- Without smoothing: AUC@30 = 76.9
- Hadamard only: 83.6
- Scaling only: 81.9
- Both (DSFQ): 86.9
- Dynamic token-wise granularity: 86.9 (highest).
LLMs (OPT, LLaMA 125M — 176B)
| Setting | Metric | Value |
|---|---|---|
| A8W4 DGQ | PPL loss over FP16 | +0.17–0.25 |
| A8W4 SmoothQuant | Relative accuracy | lower |
| Speed (A8W4 vs A16W4) | End-to-end inference | 3.24× |
| Memory (A8W4 vs A16W4) | Total reduction | 1.12× |
Dynamic DSFQ is within 0.5% of FP16 zero-shot accuracy; static DSFQ within 1–2%.
In summary, DSFQ achieves near-full-precision accuracy at extreme compression levels (W4A4, A8W4), with negligible overhead, surpassing prior generic quantization methods in both predictive fidelity and hardware efficiency (Feng et al., 25 Sep 2025, Zhang et al., 2023).