Papers
Topics
Authors
Recent
Search
2000 character limit reached

Weight-Only Quantization Techniques

Updated 6 February 2026
  • Weight-only quantization is a parameter compression technique that converts full-precision neural network weights to low-bit representations while keeping activations in high precision.
  • It leverages methods like symmetric, power-of-two, and mixed-precision schemes to reduce memory bandwidth and storage with minimal accuracy trade-offs.
  • This approach is crucial for efficient inference in large models and edge deployments, with optimization workflows and hardware-friendly kernels driving its practical success.

Weight-only quantization is a class of parameter compression techniques in deep neural networks that reduces the storage and memory-bandwidth requirements by mapping full-precision weight tensors to low-bitwidth representations, while leaving all computational activations (and often key–value caches in LLMs) in high precision. This approach targets memory- and bandwidth-bound inference regimes, especially in modern LLMs, vision transformers, and efficient edge deployments. Across deep learning, weight-only quantization enables substantial reductions in model size and inference latency with minimal accuracy loss, provided the quantizer is adapted to the complex distributions of weight tensors encountered in state-of-the-art architectures.

1. Theoretical Foundations and Problem Formulation

In canonical weight-only quantization, for a neural network with LL linear or convolutional layers, each with weights WRdout×dinW_\ell \in \mathbb{R}^{d_\mathrm{out} \times d_\mathrm{in}}, a quantization function Qb(w)Q_b(w) replaces full-precision weights with low-precision representations as follows: Qb(w)=round(wΔb)ΔbQ_b(w) = \mathrm{round}\left(\frac{w}{\Delta_b}\right)\Delta_b where bb denotes the bit-width and Δb\Delta_b is a step size, commonly chosen per layer or per group, to cover the dynamic range of the data (Lee et al., 15 Sep 2025).

Mixed-precision quantization extends this to assign different bit-widths bb_\ell to each layer, solving: minbBLE(b)s.t.=1LSize(b)Mmax,\min_{\mathbf{b} \in \mathcal{B}^L} E(\mathbf{b}) \qquad \text{s.t.}\quad \sum_{\ell=1}^L \mathrm{Size}(b_\ell) \leq M_{\max}, with E(b)E(\mathbf{b}) the model quality degradation metric and MmaxM_{\max} the memory budget. This is a combinatorial, NP-complete optimization due to the discrete nature of B\mathcal{B} and layerwise assignment (Lee et al., 15 Sep 2025).

Rate–distortion theory offers a lower bound for Gaussianized weights: D(R)=σ222RD(R) = \sigma^2 2^{-2R} where RR is the effective bitwidth and DD is the minimal achievable mean squared error (Lee et al., 24 Sep 2025). Schemes that approach or exploit this bound through fractional-bit quantizers are optimal in the information-theoretic sense for normalized (rotated or Gaussianized) weight distributions.

2. Quantization Schemes and Algorithms

Several families of weight-only quantization algorithms have been developed:

Symmetric/Uniform Quantization: Assigns uniform grid levels to weights. Typically quantizes each block, channel, or entire tensor with parameters: s=max(W)min(W)2b1,z=min(W)/ss = \frac{\max(W) - \min(W)}{2^b-1}, \quad z = -\min(W)/s and applies

Q(w)=sclip(round(w/s)+z,0,2b1)Q(w) = s \cdot \mathrm{clip}(\mathrm{round}(w/s) + z, 0, 2^b-1)

Widely used in “RTN” (round-to-nearest), basic QAT, and PTQ flows (Cheng et al., 2023, Lin et al., 2023, Lybrand et al., 2020).

Power-of-Two (PoT) and Additive PoT (APoT): Weights are mapped to the nearest power-of-two value (or sum of several powers-of-two in APoT): QPoT(w)=sign(w)2round(log2w)Q_{\mathrm{PoT}}(w) = \operatorname{sign}(w) \cdot 2^{\mathrm{round}(\log_2|w|)} This approach matches hardware requirements of bitshift-based multipliers and improves accuracy at b4b\leq4 compared to uniform quantization (Przewlocka-Rus et al., 2022).

Activation-Aware and Outlier-Aware Quantization: Methods such as AWQ and GWQ identify and protect salient weight channels either via activation statistics or by direct gradient sensitivity analysis:

  • AWQ upweights the scale of the most activation-important input channels, preserving their representational capacity during low-bit quantization (Lin et al., 2023).
  • GWQ computes gradients WL(W;Dc)\nabla_W \mathcal{L}(W; D_c) on one or a few calibration samples and retains the top 1%1\% of weights with the highest gradient magnitude in higher precision, quantizing the rest (Shao et al., 2024).

Density-Aware Quantization (DAQ): Aligns dynamic range to the densest 95% of weights (excluding outliers) via quantile estimation, then fine-tunes quantization parameters (s,z)(s, z) directly to minimize layerwise reconstruction loss with sign-SGD optimization (Luo et al., 2024).

Low-Rank and Flexible Scaling: LRQ learns a low-rank scaling matrix S=UVTS = UV^T to tailor individual weight scaling with dramatically fewer parameters than full-rank scaling, reducing overfitting and enhancing generalization in low-bit regimes (Lee et al., 2024).

Blockwise, Groupwise, and Per-Channel Quantization: FineQuant adaptively determines group size GG by a heuristic that balances range expansion with quantization error, storing blockwise scales per column or channel. Per-input-channel (per-IC) quantization isolates activation outliers, essential below $4$ bits (Heo et al., 2023, Kim et al., 2023).

Probabilistic and Bayesian Approaches: Probabilistic Weight Fixing models each weight as a Gaussian with learned mean and uncertainty, iteratively fixing subsets of weights to cluster centers based on relative distance in Mahalanobis space. This encourages entropy-minimized, robust weight clustering with global codebook assignments (Subia-Waud et al., 2023).

Mixed-Precision and Fractional Bitwidths: AMQ (Lee et al., 15 Sep 2025) and Q-Palette (Lee et al., 24 Sep 2025) construct configurations across per-layer bitwidths and quantizer types, using genetic search (e.g., NSGA-II), surrogate predictors, and information-theoretic assignment to target the Pareto frontier in memory–accuracy and latency–accuracy spaces.

3. Optimization Workflows, Search, and Calibration

The complex search space for optimal quantization assignments in modern LLMs (e.g., 32243^{224} for Llama-2 7B) prohibits brute-force enumeration. Recent systems develop:

  • Combinatorial Search with NSGA-II: AMQ prunes inert layers via per-layer sensitivity, builds proxy quantized models (QHQQQ^{\rm HQQ}), uses a surrogate (RBF) quality predictor, and applies multi-objective evolutionary algorithms to sample and converge on optimal configurations (Lee et al., 15 Sep 2025).
  • Integer Programming for Mixed Scheme Selection: Q-Palette defines quantizer/bitwidth selection as a multiple-choice knapsack problem, optionally incorporating group fusion constraints for hardware kernel fusion (Lee et al., 24 Sep 2025).
  • Heuristic Group Size Selection: FineQuant iteratively shrinks the quantization group while monitoring the blow-up in dynamic range to achieve over 94% BLEU recovery compared to fixed granularity (Kim et al., 2023).
  • Sensitivity and Hessian-Based Scoring: AdaDim algorithmically allocates bits and grouping axes where quantization loss is highest, coupling with blockwise PTQ (RTN, GPTQ) (Heo et al., 2023).
  • SignSGD for Rounding and Clipping: SignRound employs a low-cost 200-step signed gradient optimization of the quantization offsets and clipping parameters, minimizing direct reconstruction loss without introducing inference overhead (Cheng et al., 2023).

Most methods utilize a small calibration dataset (1–100 samples) and minimize layer/block output deviation to preserve output distribution or logits (DAQ, LRQ, SignRound, GWQ), often with straight-through estimator or surrogate gradient methods to overcome quantizer non-differentiability.

4. Hardware, Inference Acceleration, and Kernel Implementation

Weight-only quantization directly reduces memory traffic and enables custom hardware acceleration schemes:

Kernel Fusion and GEMM Optimization: Methods such as TinyChat (AWQ) and FineQuant develop fused int4/int8-to-FP16 GEMM kernels that perform on-the-fly dequantization using block or per-channel scales, minimizing DRAM round-trips and matching arithmetic intensity for memory-bound LLM decoding (Lin et al., 2023, Kim et al., 2023).

Shift-Based Accumulation: Power-of-two and APoT quantization replace multipliers in MAC units with barrel shifters and adders, reducing both area and energy by up to 6×6\times and 2×2\times respectively compared to uniform 8×8 bit multipliers (Przewlocka-Rus et al., 2022).

Batch- and Fusion-Aware Quantization: Q-Palette extends QTIP-style TCQ and VQ fractional-bit quantization to batch sizes up to 16 and enables kernel-level layer fusion for fused multi-head attention and FFN GEMMs (Lee et al., 24 Sep 2025).

Deployment Metrics: AWQ with TinyChat and FineQuant achieve $3.2$–3.9×3.9\times throughput increases across Llama-2, MPT, and Falcon families, with 4× reduction in weight storage (Lin et al., 2023, Kim et al., 2023). Quantization reduces model sizes proportionally: a 70B-parameter model can be compressed from 280\approx280GB (FP32) to 17.5\approx17.5 GB at 2 bits (Cheng et al., 2023).

5. Empirical Performance and Benchmarks

Comprehensive studies across architectures and tasks demonstrate that modern weight-only quantization can nearly reach full-precision accuracy at 3–4 bits, and with advanced schemes, even at 2 bits in selected settings.

Method Model Bitwidth Perplexity (WikiText2) Zero-Shot Acc (%) Notes
FP16 LLaMA-2-7B 16 5.46 70.49 Baseline
AMQ LLaMA-2-7B 3 65.59 +1.96 over BitStack, (Lee et al., 15 Sep 2025)
DAQ LLaMA-2-7B 4 (NF4) 5.60 19.6%-22.8% lower perplexity loss vs. AWQ, (Luo et al., 2024)
AWQ LLaMA-2-7B 4 5.60 70.13 FP16, (Lin et al., 2023)
SignRound Mistral-7B 4 62.33 W4G-1, (Cheng et al., 2023)
FineQuant OPT-175B 4 3.65×3.65\times throughput, <0.5<0.5 BLEU drop, (Kim et al., 2023)
GWQ LLaMA-2-7B \sim4 5.53 60.58 1.2×\times speedup, (Shao et al., 2024)
AdaDim+GPTQ LLaMA-7B 3 9.5 49.5 (Heo et al., 2023)
LRQ LLaMA-2-7B 3 6.48 59.07 State-of-the-art weight-only, (Lee et al., 2024)

Results confirm that post-training, QAT, low-rank, and mixed-precision methods can all achieve near-baseline accuracy for mainstream LLMs and vision models in the 3–4 bit regime. AMQ consistently outperforms previous any-size (BitStack, PB-LLM) and uniform baselines under matched memory constraints (Lee et al., 15 Sep 2025). Q-Palette’s mixed-scheme allocation with fractional bitwidth pushes the Pareto frontier in both memory/perplexity and latency/perplexity (Lee et al., 24 Sep 2025).

6. Best Practices, Limitations, and Open Challenges

Practical recommendations:

  • For 4-bit quantization, per-input-channel grouping and adaptive or sensitivity-driven allocation (AdaDim, GWQ) are recommended on all current LLMs (Heo et al., 2023, Shao et al., 2024).
  • At 3 or fewer bits, advanced or hybrid approaches (probabilistic, activation- or gradient-aware, low-rank scaling) are critical to avoid accuracy collapse.
  • Hardware deployment favors block-/groupwise quantization (group size 64–128), with dequantization fused to GEMM kernels (Kim et al., 2023, Lin et al., 2023).
  • For resource-constrained edge or batch-1 inference, develop and deploy kernels supporting non-integer or fractional bitwidths per layer for maximal throughput at fixed accuracy (Lee et al., 24 Sep 2025).

Limitations and open issues:

  • The combinatorial explosion of mixed-precision assignments requires advanced AutoML, genetic search, or integer programming (e.g. in AMQ, Q-Palette).
  • Sensitivity analysis is costly for very large models, motivating faster proxies (e.g. activation-independent simulated quantization).
  • Calibration sets remain a bottleneck for ultra-low bitwidths; methods that minimize data requirements (GWQ) are advancing.
  • Hardware support for truly arbitrary per-layer or per-block bitwidths and non-uniform quantization (APoT, VQ, TCQ) still lags, especially on FPGAs and custom ASICs (Przewlocka-Rus et al., 2022, Lee et al., 24 Sep 2025).

Weight-only quantization continues to be a dominant axis for achieving efficient inference and deployment in large-scale neural models. Recent directions include:

For advanced LLMs, emerging empirical and theoretical results demonstrate that with proper axis grouping, sensitivity control, and cross-layer optimization, weight-only quantization below 4 bits can match or approach full-precision performance, enabling resource-efficient deployment at massive scale.


Key references:

(Lee et al., 15 Sep 2025) (AMQ), (Luo et al., 2024) (DAQ), (Lin et al., 2023) (AWQ), (Shao et al., 2024) (GWQ), (Heo et al., 2023) (AdaDim), (Lee et al., 2024) (LRQ), (Kim et al., 2023) (FineQuant), (Cheng et al., 2023) (SignRound), (Lee et al., 24 Sep 2025) (Q-Palette), (Ye et al., 2018) (ADMM compression), (Przewlocka-Rus et al., 2022) (PoT/APoT), (Subia-Waud et al., 2023) (PWFN), (Zhang et al., 14 Nov 2025) (TaWQ).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weight-Only Quantization.