Papers
Topics
Authors
Recent
Search
2000 character limit reached

SmoothQuant: Efficient PTQ for Large Models

Updated 5 February 2026
  • SmoothQuant is a training-free post-training quantization technique that transfers activation outlier effects to weights, enabling efficient low-bit compression.
  • It leverages a channel-wise re-parameterization of linear layers using calibrated scaling factors to balance quantization difficulty, ensuring near-lossless accuracy at 8-bit and 4-bit precisions.
  • Extensions like SmoothQuant+, MXINT formats, and ViT adaptations enhance inference speed and reduce memory usage while maintaining state-of-the-art model performance.

SmoothQuant is a mathematically exact, training-free post-training quantization (PTQ) technique for large neural models. Its core principle is to migrate activation outlier-induced quantization error into the model’s weights, enabling efficient and accurate quantization with minimal architectural or inference-time changes. SmoothQuant has been widely validated for LLMs, vision transformers (ViTs), and hybrid quantization pipelines, achieving near-lossless compression at 8-bit and 4-bit granularities for activations and weights, respectively (Xiao et al., 2022, Pan et al., 2023, Tai et al., 2024, Sharify et al., 2024).

1. Quantization Challenges and Motivation

Quantization reduces model precision to accelerate inference and lower memory requirements, but transformer architectures commonly present two technical hurdles:

  • Weight Quantization: Transformer weights typically have relatively uniform per-channel dynamic ranges, so low-bit quantization (e.g., INT8) induces little accuracy loss.
  • Activation Quantization: Activations frequently exhibit long-tailed outlier channels, with values 10–100× larger than median channels. Standard per-tensor quantization is dominated by these outliers: nearly all non-outlier channels collapse to a few quantization bins, resulting in substantial representational error.
  • Performance Constraint: High-throughput inference (e.g., INT8 GEMM) mandates uniform per-tensor scaling factors and rejects per-channel or per-token granularity, precluding finer-grained dynamic quantization.

SmoothQuant directly addresses the contrast between the “easy to quantize” weights and the “outlier-prone” activations by mathematically transferring quantization difficulty from activations to weights (Xiao et al., 2022, Sharify et al., 2024).

2. Mathematical Formulation and Smoothing Principle

SmoothQuant operates on individual linear layers. For input activation matrix XRN×CinX\in\mathbb{R}^{N\times C_{\text{in}}} and weight matrix WRCin×CoutW\in\mathbb{R}^{C_{\text{in}}\times C_{\text{out}}}, the canonical forward pass is Y=XWY = XW. The re-parameterization is:

Y=XW=(Xdiag(s)1)(diag(s)W)X^W^Y = XW = (X\,\text{diag}(s)^{-1})\,(\text{diag}(s)W) \equiv \hat{X}\hat{W}

where sRCins \in \mathbb{R}^{C_{\text{in}}} is a positive, channel-wise scaling vector. The choice of ss balances the quantization difficulty according to:

sj=ajαwj1αs_j = \frac{a_j^{\alpha}}{w_j^{1-\alpha}}

where aj=maxX:,ja_j = \max |X_{:,j}| (activation range over calibration data), wj=maxWj,:w_j = \max |W_{j,:}| (weight row range), and α[0,1]\alpha\in[0,1] is a tuning hyperparameter (the “migration strength”). α=0\alpha=0 migrates all difficulty to activations, α=1\alpha=1 to weights. α0.5\alpha\approx0.5 typically balances both, minimizing overall quantization loss (Xiao et al., 2022, Sharify et al., 2024, Pan et al., 2023).

The smoothing operation preserves mathematical equivalence; it is an invertible re-parameterization of inference-time computations with no training or bias correction needed.

3. Algorithmic Workflow and Integration Steps

SmoothQuant consists of the following sequential steps:

  1. Calibration: Run a small, unlabeled sample through the model to capture per-channel maxima aja_j and wjw_j for each linear layer.
  2. Smoothing Factor Calculation: Compute sjs_j per channel using the chosen α\alpha:

sj=ajα/wj1αs_j = a_j^\alpha / w_j^{1-\alpha}

  1. Re-parameterization:
    • Replace Wdiag(s)WW \to \text{diag}(s)W for each linear layer.
    • Fuse XXdiag(s)1X \to X\,\text{diag}(s)^{-1} upstream into previous operations (e.g., LayerNorm or prior linear transforms) to avoid any additional inference cost.
  2. Quantization: Apply standard per-tensor affine quantization (INT8 or desired format) to X^\hat{X} and W^\hat{W}:

Δ=maxX^2b1,Xˉ=round(X^/Δ)Δ\Delta = \frac{\max|\hat{X}|}{2^{b}-1},\quad \bar{X} = \text{round}(\hat{X}/\Delta)\cdot\Delta

Analogous for W^\hat{W}.

  1. Deployment: At runtime, use standard per-tensor low-precision GEMM operators. For INT8, only a single scale per activation/weight tensor is required.

Hardware integration is straightforward; SmoothQuant directly supports PyTorch+Cutlass, NVIDIA FasterTransformer, Intel MKL-DT, and any platform capable of INT8 GEMM (Xiao et al., 2022, Sharify et al., 2024).

4. Extensions: 4-bit Quantization, Microscaling, and Vision Transformers

4.1. SmoothQuant+

SmoothQuant+ adapts the smoothing principle for weight-only, group-wise 4-bit quantization of LLMs. After applying channel-wise smoothing to activations and weights as above, weights are quantized group-wise (typical group size G=128G=128) using per-group scales and zero-points:

Δg=maxWgminWg15,Zg=round(minWg/Δg)\Delta_g = \frac{\max W'_g - \min W'_g}{15},\quad Z_g = \text{round}(-\min W'_g/\Delta_g)

The reparameterization is grid-searched for optimal α\alpha using evaluation set error (e.g., HumanEval). SmoothQuant+ is implemented in vLLM, supporting W4A16 kernels and yielding lossless, SOTA 4-bit weight quantization for Code Llama-34B (Pan et al., 2023).

4.2. Microscaling (MXINT) Formats

Microscaling quantization combines SmoothQuant’s smoothing with MXINT formats (e.g., MXINT4-128), which compactly store bb elements per microblock using dd-bit integers and a shared exponent. After smoothing, weight tensors are partitioned, and each block is quantized per-block using the group maxima for scaling. This enables near-baseline perplexity for LLaMA and OPT models at 4–6 bits (Sharify et al., 2024).

4.3. SmoothQuant for Vision Transformers

In ViTs, post-LayerNorm activations display both large outliers and distribution asymmetry. The SQ-b extension introduces a bias term μj=mean(Yj)\mu_j = \text{mean}(Y_j) to recentralize activations, sharply reducing symmetric clamping loss. Further, the OPT-m method automatically divides post-GeLU activations into quantization regions with hardware-friendly scaling according to calibration statistics. Combined with greedy bit-width selection (Greedy MP), post-training quantization at 4/5 bits achieves up to 23% accuracy improvement over baselines on ImageNet (Tai et al., 2024).

5. Empirical Performance and Results

SmoothQuant demonstrates minimal loss in accuracy across a spectrum of model sizes:

Model FP16 Accuracy SQ Quantized Accuracy/Perplexity
OPT-175B 66.9% 66.8%
BLOOM-176B 68.2% 67.4%
GLM-130B 73.8% 72.8%
LLaMA-2-70B PPL = 6.17 PPL = 6.20
Falcon-40B PPL = 5.228 PPL = 5.255
Mistral-7B PPL = 5.253 PPL = 5.277

Key efficiency findings include up to 1.6× speedup and 2× memory reduction for LLM inference (e.g., 4 GPUs instead of 8 for OPT-175B, similar or better latency), and deployment of models as large as 530B parameters within a single node (Xiao et al., 2022, Pan et al., 2023). For Code Llama-34B, SmoothQuant+ achieves equal or slightly better HumanEval pass@1 and average multilingual scores compared to FP16, with throughput gains of 1.9–4× and 25% of memory footprint (Pan et al., 2023).

6. Limitations, Best Practices, and Interaction with Other Methods

  • Calibration: Performed statically; adaptive rescaling may be required for long-context or domain-shifting inputs. Per-tensor calibration is most appropriate; per-channel quantization, when feasible, may obviate the need for smoothing.
  • Scope: Optimally applied to “activation–weight” multiplications; “activation–activation” scenarios (e.g., attention scores) show negligible or negative benefit (Sharify et al., 2024).
  • Format Choices: α0.5\alpha \approx 0.5 is generally robust. Pushing all migration onto either activations or weights degrades performance.
  • Pipeline Integration: Efficient for deployment-only (PTQ) workflows; non-linear functions (LayerNorm, Softmax) are not quantized by SmoothQuant and remain FP16/FP32.
  • Combinatorics: At 4 bits, composes effectively with Hessian-aware quantization (GPTQ) for further improvements, particularly on small models (Sharify et al., 2024).
  • Overhead: Storage of per-channel scales is negligible even for billion-parameter models (a few MB in FP32/FP16).
  • Limitation: Does not address compression of non-linear layers and may require hybrid or dynamic strategies for models >530>530B parameters (Xiao et al., 2022, Pan et al., 2023).

7. Impact and Recent Directions

SmoothQuant serves as the canonical method for enabling practical, high-throughput low-bit quantization of transformers without requiring retraining. It has catalyzed several extensions:

  • SmoothQuant+ for lossless group-wise 4-bit weight quantization (Pan et al., 2023)
  • Bias-corrected SQ-b and multi-regional OPT-m methods for extreme quantization of ViTs (Tai et al., 2024)
  • Efficient pairing with MXINT formats and Hessian-aware quantization for ultra-compact LLM storage and inference (Sharify et al., 2024)

SmoothQuant’s mathematically exact smoothing strategy remains integral to state-of-the-art transformer quantization pipelines for both NLP and vision domains. Its post-training and algebraically lossless reparameterization ensures preserved accuracy and maximum hardware efficiency, framing best-practice for large-scale model deployment.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SmoothQuant.