Activation Quantization Techniques

Updated 28 January 2026

Activation quantization is the process of discretizing dynamic neural network activations into few-bit representations to reduce memory, compute, and communication costs.
Mixed-precision and adaptive quantizers allocate bits per channel or tile using techniques like learnable clipping, outlier suppression, and SVD-based decomposition to minimize error at ultra-low precisions.
Efficient quantization methods deliver significant hardware improvements by enabling faster inference and training in systems such as LLMs and vision transformers with minimal accuracy loss.

Activation quantization is the process of discretizing the intermediate activations of a neural network to few-bit representations, usually for the purposes of reducing memory and compute overhead during training or inference. Activations, in contrast to weights, are dynamically generated at runtime; low-bit activation quantization thus directly impacts memory bandwidth, storage, communication costs in distributed setups, and the compatibility of hardware accelerators with integer arithmetic. In recent years, a diversity of techniques—ranging from fixed- and mixed-precision static quantizers, channel- and layer-wise adaptive schemes, to rotation-based outlier suppression and information bottleneck approaches—have been introduced to constrain quantization error at ultra-low precisions (3–8 bits), particularly for LLMs, vision transformers, and compute-in-memory (CIM) hardware.

1. Theoretical Foundations and Quantization Schemes

Uniform quantization forms the baseline for most activation quantization pipelines. For a real-valued activation vector $x$ , a b-bit symmetric uniform quantizer $Q(x, b)$ is typically defined as

$Q(x, b)\;=\;\Delta(b)\;\times\;\mathrm{clip}\!\Bigl(\mathrm{round}\bigl(x/\Delta(b)\bigr),\,-M_b,M_b\Bigr)$

where $M_b=2^{b-1}-1$ and $\Delta(b)=\tfrac{\max(x)-\min(x)}{2\,M_b}$ is the step size (Song et al., 7 Oct 2025).

This construction, however, is highly sensitive to outliers: a single large-magnitude entry in $x$ expands $\Delta(b)$ , causing quantization error to be concentrated on the bulk of the distribution. Consequently, myriad methods address activation quantization error by outlier suppression, non-uniform quantizer design, or mixed-precision allocation (Yang et al., 2024, Czakó et al., 11 May 2025, Maisonnave et al., 18 Apr 2025).

Alternative approaches include learnable clipping parameters (PACT (Choi et al., 2018)), stochastic quantization or information-bottleneck-inspired stochastic coding (Zhou et al., 2020), and sophisticated data-driven quantizer designs such as dINT for underflow control (Lee et al., 2023).

Mixed-precision and Adaptive Quantizers

Methods like Adaptive Mixed-bit Activation Quantization (AMAQ) learn per-channel or per-tile bitwidths, subject to a total bit budget. Suppose $b_{l,c}$ is the bitwidth for channel $c$ in layer $l$ , then one regularizes towards a target mean $Q(x, b)$ 0 with a weighted L1 penalty:

$Q(x, b)$ 1

where $Q(x, b)$ 2 reflects feature or gradient-based channel importance (Song et al., 7 Oct 2025).

Bit allocation can be further refined by entropy-guided or outlier-aware metrics (He et al., 2 Jun 2025), while token- or window-based importance assignments are used for structured models such as Swin Transformers (Wang et al., 25 Jul 2025) or pipeline-parallel LLMs (He et al., 2 Jun 2025).

2. Outlier Suppression and Activation Distributions

Outliers in activations, stemming from distributional heavy tails, channel- or token-specific spikes, or systematic architectural features (e.g., GLU FFNs), are the leading source of catastrophic quantization error at low bitwidths (Nrusimha et al., 2024, Yang et al., 2024, Czakó et al., 11 May 2025). These include:

Systematic channel outliers: Channels with consistently abnormally large values, often due to training dynamics or residual connections (Nrusimha et al., 2024, Czakó et al., 11 May 2025).
Token-specific spikes: Single tokens, such as BOS or certain punctuation, induce extreme activations for only a few modules/layers/tokens (Yang et al., 2024).

Suppression strategies comprise:

Activation clamping: Clipping activations above a learnable threshold before quantization, as in PACT (Choi et al., 2018), outlier-clamp (Nie et al., 2022), or QAT (Nrusimha et al., 2024).
Rotation-based transforms: Orthogonal transformations (random, Hadamard, DWT, SVD) spread outlier energy across the space, reducing the quantizer’s maximum. Hadamard and DWT achieve optimal $Q(x, b)$ 3 reduction for an $Q(x, b)$ 4-dimensional vector (Maisonnave et al., 18 Apr 2025, Federici et al., 30 Oct 2025, Czakó et al., 11 May 2025).
Prefixing or module isolation: For LLMs, inserting fixed KV prefixes (CushionCache (Son et al., 2024), QFeP (Yang et al., 2024)) absorbs attention sinks; module-wise exceptions (QFeM) exclude only spike-dominated modules from quantization.
Statistical and structure-aware codebooks: Huffman-coded shifting errors as in DQA enable lossless or near-lossless ultra-low bit coding for important channels (Hu et al., 2024).
Noise-based methods: Additive noise (NoisyQuant) intentionally smooths activation distributions to minimize expected error under quantization (Liu et al., 2022).

3. Advanced Quantization Algorithms: Adaptive, Hybrid, and Information-Theoretic Schemes

Channel-, Token-, and Window-Adaptive Bitwidths

Mixed-precision assignment, where bits are allocated based on local importance, entropy, or sparsity, is now standard practice for collaborative, distributed, and edge-NN setups (Song et al., 7 Oct 2025, He et al., 2 Jun 2025, Wang et al., 25 Jul 2025).

Entropy-guided allocation: Assigns more bits to activation tiles/tokens with higher entropy, based on the spread or expected contribution to compute (He et al., 2 Jun 2025).
Feature or variance-weighted bit assignment: Channel importance weights $Q(x, b)$ 5 can be defined as activation variance or gradient magnitude, yielding near-monotonic gains in task accuracy for the same average bitwidth (Song et al., 7 Oct 2025).

Outlier-Aware Decomposition

QUAD (Quantization with Activation Decomposition) uses SVD over a calibration set to construct a lifting transform that isolates outlier singular vectors into a full-precision subspace while quantizing the remaining components at low bitwidth (Hu et al., 25 Mar 2025). This achieves 94–96% of full-precison accuracy under W4A4 quantization, and up to 98% when adding low-dimensional adapters.

Bitwise Information Bottleneck

BIB schemes formulate the optimal selection of quantization bits by directly minimizing rate–distortion tradeoffs per layer, with sparsity-inducing penalties to select informative bits (Zhou et al., 2020). This approach adapts the code-rate to the intrinsic information content of each layer.

4. Empirical Impact and Hardware Considerations

Generation and Classification Accuracy

Recent activation quantization methods, including AMAQ, QFeM/QFeP, QUAD, STaMP, and DQA, consistently recover most of the task accuracy lost under uniform fixed-precision quantization:

AMAQ yields up to 2.5% higher generation accuracy and 1.3% better classification accuracy for modern LLMs under matched bit-budgets relative to fixed-precision QAT (Song et al., 7 Oct 2025).
QFeM/QFeP and CushionCache close near the entire perplexity and accuracy gap induced by INT8 baseline quantization for GLU and causal LLMs (Yang et al., 2024, Son et al., 2024).
STaMP sequence transforms, when combined with mixed-precision per-token quantization, yield >1 dB SQNR and restore baseline perplexity for both LLM and LVM blocks under 4-bit quantization (Federici et al., 30 Oct 2025).
DQA achieves up to 29% accuracy gains over direct quantization, equalling or surpassing prior art such as NoisyQuant for <6-bit activation coding on both classification and segmentation (Hu et al., 2024).

Computational and Memory Efficiency

Efficient quantization directly benefits hardware by minimizing data bandwidth, storage, and arithmetic complexity:

BWMA analysis reveals 4-bit activation quantization as the sweet spot on compute-in-memory accelerators, achieving near-floating-point accuracy with only a 15% hardware penalty relative to 3 bits (Zhou et al., 29 Aug 2025).
On-device speedups for integer-multiplication hardware reach 2.5× on edge CPUs for LLMs under 4-bit quantization with proper activation-aware pruning (Agile-Quant) (Shen et al., 2023).
ActNN demonstrates that even compressed 2-bit stochastic quantization of activations during training reduces activation memory by 12× and allows 6–14× larger batch sizes with <0.5% accuracy loss (Chen et al., 2021).

Communication-Aware Collaborative Training

Mixed-precision activation quantization is critical in bandwidth-limited distributed and pipeline-parallel training. AMAQ and TAH-Quant operate with 3–4 bits/activation, achieving $Q(x, b)$ 6 speed-ups in pipeline-parallel LLM pretraining and fine-tuning, while maintaining convergence and accuracy comparable to full-precision baselines; metadata and extra bit-distribution cost is negligible (Song et al., 7 Oct 2025, He et al., 2 Jun 2025).

5. Limitations, Trade-offs, and Practical Guidelines

Despite these advances, practical constraints remain:

Communication overhead: Adaptive/mixed-precision techniques incur minor extra communication (≤9% batch size in AMAQ for distributed learning), which is amortized by gains in accuracy or speed (Song et al., 7 Oct 2025).
Latency and kernel complexity: Advanced transforms (Hadamard, DWT, SVD) add moderate latency (≤10%), but can be fused for negligible runtime overhead (Federici et al., 30 Oct 2025, Maisonnave et al., 18 Apr 2025, Hu et al., 25 Mar 2025).
Calibration cost and transferability: Many methods (e.g., QUAD, DQA) require one-time offline calibration, which may mismodel out-of-distribution data. Per-layer or token/channel importance may drift during distribution shift or extensive fine-tuning.
Limitations of hardware support: Effective INT4 or INT3 matmuls may be unavailable on some accelerator generations; underflow/overflow-resilient coding (dINT, DQA, bit bottleneck) is especially relevant for low-bit deployment (Lee et al., 2023, Hu et al., 2024).

Summary Table: Key Features of Contemporary Activation Quantization Methods

Method	Outlier Control	Bit Allocation	Hardware Impact	Key Accuracy Result
AMAQ (Song et al., 7 Oct 2025)	Feature-wise, regularized	Per-channel, adaptive	4b, mixed-bit; +9% comms	+2.5% gen, +1.3% cls (LLaMA3, Qwen2.5)
QFeM/QFeP (Yang et al., 2024)	Spike isolation	Per-module, prefix	INT8/FP16 fallback	+16ppt zero-shot acc. (LLaMA2-13B, W8A8)
CushionCache (Son et al., 2024)	Learned prefix sink	Static, per-tensor	0 overhead, static W8A8	PPL: 9759→7.4 (LLaMA3-8B), +31.99ppt accuracy
QUAD (Hu et al., 25 Mar 2025)	SVD outlier split	Bulk 4b, residual FP16	65–70% INT4, rest FP16	94–96% W4A4, 98% with PEFT
DQA (Hu et al., 2024)	Imp. channel shifting/Huffman	Per-channel, mask	3–5b (INT), fast decode	+29.3% (ResNet-32, 3b), matches NoisyQuant
STaMP (Federici et al., 30 Oct 2025)	Sequence DWT, energy comp.	Mixed-precision, token	Pure integer, no retrain	+1–1.5 dB SQNR, recovers PPL baseline
BWMA (Zhou et al., 29 Aug 2025)	Closed-form error opt.	4b acts, bin weights	CIM optimal: 4b acts	+5.46% CIFAR, +5.37% ImNet over baselines

6. Methodological Trends and Future Directions

Recent research trends emphasize:

Ultra-low bit quantization (<4b): Techniques robust to bit underflow, denormal encoding (dINT), shifting/Huffman coding, and binary-activation architectures are enabling deep quantization without catastrophic accuracy loss (Lee et al., 2023, Hu et al., 2024, Song et al., 7 Apr 2025).
Outlier-adaptive and hybrid allocation: Universal frameworks for dynamic per-channel/adaptive per-window assignment deliver robustness in face of wide activation heterogeneity (Wang et al., 25 Jul 2025, He et al., 2 Jun 2025).
Non-uniform and learnable quantizers: Information bottleneck and feedback adjustment of per-bit scaling offer principled, theoretically-grounded approaches to rate–distortion trade-off optimization (Zhou et al., 2020, Song et al., 7 Apr 2025).
Joint weight/activation quantization and fine-tuning: Fine-tuning of small full-precision or adapter subspaces while maintaining aggressive activation quantization enables parameter-efficient downstream adaptation (Hu et al., 25 Mar 2025).

Broader challenges include modeling activation distribution shift under long-range autoregressive generation, scaling calibration to foundation models, and extending efficient quantization beyond transformers to other architectures and tasks.

References