Additive Quantization for Language Models (AQLM)

Updated 31 December 2025

The paper presents AQLM, which generalizes classic additive quantization to LLMs, achieving Pareto-optimal trade-offs at 2–3 bits per parameter.
It employs input-adaptive code assignment and joint block-wise optimization via an EM-style process to minimize output distortion during calibration.
The method enables up to 8× model size reduction with notable speedups on both GPU and CPU, facilitating efficient on-device inference.

Additive Quantization for LLMs (AQLM) is an advanced post-training compression technique developed for extreme quantization of LLMs, such as transformer-based architectures. AQLM generalizes the classic Additive Quantization (AQ) approach, traditionally used in information retrieval, to quantize weight matrices in LLMs to exceptionally low bit-counts—specifically targeting the 2 to 3 bits-per-parameter regime. By integrating input-adaptive code assignment and joint block-wise optimization of quantization parameters, AQLM achieves Pareto-optimal trade-offs between accuracy and model size, making it practical for deployment on resource-constrained devices (Egiazarian et al., 2024).

1. Formal Problem Statement

AQLM addresses the problem of compressing pretrained transformer LLMs by replacing the floating-point weight matrices $W \in \mathbb{R}^{d_{out} \times d_{in}}$ with quantized approximations $\hat{W}$ using only $B$ bits per parameter, with the principal focus on $B \approx 2\dots3$ . This compression yields up to $8\times$ reduction in model size compared to FP16 baselines.

Classic AQ encodes groups of model weights as sums of $M$ codebook vectors (centroids) chosen from learned codebooks $\{C^{(m)}\}_{m=1}^M$ , with assignments governed by one-hot vectors $b^{(m)}$ . Row $w$ is approximated as $w \approx \sum_{m=1}^M C^{(m)} b^{(m)}$ , and total bit cost is determined by codebook size and group granularity. The AQ layer-level reconstruction objective is:

$E_Q(C, b) = \sum_{i=1}^{d_{out}} \| w_i - \sum_{m=1}^M C^{(m)} b_i^{(m)} \|_2^2.$

AQLM reframes the objective to preserve layer outputs on a calibration set:

$\| W X - \hat{W} X \|_F^2 = \| (W - \sum_{m=1}^M C^{(m)} b^{(m)}) X \|_F^2,$

where $X$ is a matrix of calibration inputs.

2. Algorithmic Innovations

AQLM advances AQ via two central mechanisms:

Input-adaptive quantization: Code assignments $b$ are data-aware, chosen to minimize output distortion for a specific set of calibration inputs $X$ rather than purely weight-level reconstruction.
Joint block-wise codebook optimization: Quantization errors from multiple linear layers in a transformer block are addressed collectively by fine-tuning codebooks, scaling parameters $s$ , and remaining small parameters $\theta$ to minimize output mismatch at the block level.

The loss for block-level optimization is:

$L_{block} = \| F_{block}(X) - \hat{F}_{block}(X; C, b, s) \|_F^2$

and for the full model,

$L_{AQLM} = \sum_{\ell} \| F^{(\ell)}(X^{(\ell)}) - \hat{F}^{(\ell)}(X^{(\ell)}; C^{(\ell)}, b^{(\ell)}, s^{(\ell)}) \|_F^2 + \lambda \sum_{\ell,m} \| W^{(\ell)} - \sum_{m=1}^M C^{(\ell,m)} b^{(\ell,m)} \|_F^2,$

where $\lambda$ typically is small or zero.

Optimization proceeds via an EM-style process:

E-step: Updates code assignments $b$ through beam-search in a Markov Random Field formulation (MRF), leveraging precomputed gram matrices.
M-step: Refines codebooks $C$ , scales $s$ , and small non-quantized parameters $\theta$ using Adam optimizer.

Pseudo-code succinctly describes calibrating a block:

Step	Description
Codebook Initialization	Residual K-means on weight matrix rows
Gram Matrix	Precompute $G = X X^\top$
E-step	Beam-search code assignment per output unit/group
M-step	Adam updates on $C$ , $s$ , $\theta$ to minimize block loss

3. Theoretical Analysis

Reconstruction Bounds: For $M$ codebooks of size $K$ , and assignments minimizing MSE,

$\mathbb{E}[\|w - \sum_m C^{(m)}b^{(m)}\|_2^2] \leq c(M,K) \mathbb{E}[\|w - w'\|_2^2]$

with $w'$ the closest of $MK$ prototypes and $c(M,K) \to 0$ as $M, K$ grow (cf. Babenko & Lempitsky 2014). Empirically, sub-percent layer-level error arises with $B=2\dots3$ , $M\approx2$ .

Pareto Frontier: For LLaMA 2-7B on WikiText2,
- FP16 compression: 5.12 PPL @ 8 bytes/param
- QuIP# (2 bit): 8.22 PPL @ 2.02 bits/param
- AQLM (2 bit): 6.64 PPL @ 2.02 bits/param

AQLM is strictly Pareto-optimal versus all prior 2-bit methods for perplexity versus model size, outperforming certain higher-bit (e.g., 4-bit GPTQ) baselines on smaller models.

4. Implementation and Empirical Performance

AQLM supports high-throughput inference on both GPU and CPU via codebook lookup tables:

GPU kernel: Precomputes $M\times K$ lookups for each group, with $O(M)$ additions per group; achieves $\sim1.2\times$ FP16 speed on RTX 3090 (LLaMA 2-70B).
CPU kernel: Splits each 16-bit codebook into $8$-bit sub-codebooks so lookups reside in L1/L2 cache; up to $4\times$ FP32 speedup on a 16-core Intel i9.

Model footprint is reduced by $8\times$ at 2 bits/parameter relative to FP16, while maintaining or exceeding speed.

Summary of token generation rates:

Device	FP16	AQLM (2 bit)	AQLM (2×8 bit)
RTX 3090, LLaMA-2 7B	41.5 tok/s	32.2 tok/s	32.6 tok/s
Intel i9, LLaMA-2 7B	3.1 tok/s	7.0 tok/s	6.8 tok/s

5. Compression–Accuracy Trade-Offs and Calibration

Several operational variables modulate AQLM’s effectiveness:

Calibration set size: Gains saturate around 2,000 calibration sequences (useful range: 512–4,096).
Codebook number $M$ , bits $B$ , group size $g$ : Increasing $M$ improves accuracy with fixed $B \cdot M / g$ budget but incurs higher E-step computational cost.
Block-wise fine-tuning: 100–300 Adam steps increase calibration time by $10$– $30\%$ , securing $5$– $10\%$ PPL reduction.

AQLM thus generalizes well with modest one-shot calibration effort, especially compared to direct PTQ methods.

6. Limitations and Future Prospects

AQLM is the first post-training quantization scheme to reach Pareto optimality below 3 bits/parameter on open LLMs, yielding state-of-the-art perplexity and zero-shot accuracy in the extreme quantization regime. Noted limitations:

Calibration cost (beam-search E-step) exceeds direct PTQ approaches (e.g., RTN/GPTQ), but remains practical for one-shot application.
Homogeneous codebooks: Current versions employ fixed codebook architectures; incorporating sparsity or non-uniform (layer-dependent) bit allocation could permit further gains.
Activation quantization: Extending AQLM to quantized activation flows (quantization-aware inference) could push bit-efficiency below 2 bits.

These results indicate the scalable adaptation of multi-codebook quantization for extreme LLM compression, facilitating efficient on-device inference at high fidelity (Egiazarian et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Extreme Compression of Large Language Models via Additive Quantization (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Additive Quantization for Language Models (AQLM).