Papers
Topics
Authors
Recent
Search
2000 character limit reached

Additive Quantization for Language Models (AQLM)

Updated 31 December 2025
  • The paper presents AQLM, which generalizes classic additive quantization to LLMs, achieving Pareto-optimal trade-offs at 2–3 bits per parameter.
  • It employs input-adaptive code assignment and joint block-wise optimization via an EM-style process to minimize output distortion during calibration.
  • The method enables up to 8× model size reduction with notable speedups on both GPU and CPU, facilitating efficient on-device inference.

Additive Quantization for LLMs (AQLM) is an advanced post-training compression technique developed for extreme quantization of LLMs, such as transformer-based architectures. AQLM generalizes the classic Additive Quantization (AQ) approach, traditionally used in information retrieval, to quantize weight matrices in LLMs to exceptionally low bit-counts—specifically targeting the 2 to 3 bits-per-parameter regime. By integrating input-adaptive code assignment and joint block-wise optimization of quantization parameters, AQLM achieves Pareto-optimal trade-offs between accuracy and model size, making it practical for deployment on resource-constrained devices (Egiazarian et al., 2024).

1. Formal Problem Statement

AQLM addresses the problem of compressing pretrained transformer LLMs by replacing the floating-point weight matrices WRdout×dinW \in \mathbb{R}^{d_{out} \times d_{in}} with quantized approximations W^\hat{W} using only BB bits per parameter, with the principal focus on B23B \approx 2\dots3. This compression yields up to 8×8\times reduction in model size compared to FP16 baselines.

Classic AQ encodes groups of model weights as sums of MM codebook vectors (centroids) chosen from learned codebooks {C(m)}m=1M\{C^{(m)}\}_{m=1}^M, with assignments governed by one-hot vectors b(m)b^{(m)}. Row ww is approximated as wm=1MC(m)b(m)w \approx \sum_{m=1}^M C^{(m)} b^{(m)}, and total bit cost is determined by codebook size and group granularity. The AQ layer-level reconstruction objective is:

EQ(C,b)=i=1doutwim=1MC(m)bi(m)22.E_Q(C, b) = \sum_{i=1}^{d_{out}} \| w_i - \sum_{m=1}^M C^{(m)} b_i^{(m)} \|_2^2.

AQLM reframes the objective to preserve layer outputs on a calibration set:

WXW^XF2=(Wm=1MC(m)b(m))XF2,\| W X - \hat{W} X \|_F^2 = \| (W - \sum_{m=1}^M C^{(m)} b^{(m)}) X \|_F^2,

where XX is a matrix of calibration inputs.

2. Algorithmic Innovations

AQLM advances AQ via two central mechanisms:

  • Input-adaptive quantization: Code assignments bb are data-aware, chosen to minimize output distortion for a specific set of calibration inputs XX rather than purely weight-level reconstruction.
  • Joint block-wise codebook optimization: Quantization errors from multiple linear layers in a transformer block are addressed collectively by fine-tuning codebooks, scaling parameters ss, and remaining small parameters θ\theta to minimize output mismatch at the block level.

The loss for block-level optimization is:

Lblock=Fblock(X)F^block(X;C,b,s)F2L_{block} = \| F_{block}(X) - \hat{F}_{block}(X; C, b, s) \|_F^2

and for the full model,

LAQLM=F()(X())F^()(X();C(),b(),s())F2+λ,mW()m=1MC(,m)b(,m)F2,L_{AQLM} = \sum_{\ell} \| F^{(\ell)}(X^{(\ell)}) - \hat{F}^{(\ell)}(X^{(\ell)}; C^{(\ell)}, b^{(\ell)}, s^{(\ell)}) \|_F^2 + \lambda \sum_{\ell,m} \| W^{(\ell)} - \sum_{m=1}^M C^{(\ell,m)} b^{(\ell,m)} \|_F^2,

where λ\lambda typically is small or zero.

Optimization proceeds via an EM-style process:

  • E-step: Updates code assignments bb through beam-search in a Markov Random Field formulation (MRF), leveraging precomputed gram matrices.
  • M-step: Refines codebooks CC, scales ss, and small non-quantized parameters θ\theta using Adam optimizer.

Pseudo-code succinctly describes calibrating a block:

Step Description
Codebook Initialization Residual K-means on weight matrix rows
Gram Matrix Precompute G=XXG = X X^\top
E-step Beam-search code assignment per output unit/group
M-step Adam updates on CC, ss, θ\theta to minimize block loss

3. Theoretical Analysis

  • Reconstruction Bounds: For MM codebooks of size KK, and assignments minimizing MSE,

E[wmC(m)b(m)22]c(M,K)E[ww22]\mathbb{E}[\|w - \sum_m C^{(m)}b^{(m)}\|_2^2] \leq c(M,K) \mathbb{E}[\|w - w'\|_2^2]

with ww' the closest of MKMK prototypes and c(M,K)0c(M,K) \to 0 as M,KM, K grow (cf. Babenko & Lempitsky 2014). Empirically, sub-percent layer-level error arises with B=23B=2\dots3, M2M\approx2.

  • Pareto Frontier: For LLaMA 2-7B on WikiText2,
    • FP16 compression: 5.12 PPL @ 8 bytes/param
    • QuIP# (2 bit): 8.22 PPL @ 2.02 bits/param
    • AQLM (2 bit): 6.64 PPL @ 2.02 bits/param

AQLM is strictly Pareto-optimal versus all prior 2-bit methods for perplexity versus model size, outperforming certain higher-bit (e.g., 4-bit GPTQ) baselines on smaller models.

4. Implementation and Empirical Performance

AQLM supports high-throughput inference on both GPU and CPU via codebook lookup tables:

  • GPU kernel: Precomputes M×KM\times K lookups for each group, with O(M)O(M) additions per group; achieves 1.2×\sim1.2\times FP16 speed on RTX 3090 (LLaMA 2-70B).
  • CPU kernel: Splits each 16-bit codebook into $8$-bit sub-codebooks so lookups reside in L1/L2 cache; up to 4×4\times FP32 speedup on a 16-core Intel i9.

Model footprint is reduced by 8×8\times at 2 bits/parameter relative to FP16, while maintaining or exceeding speed.

Summary of token generation rates:

Device FP16 AQLM (2 bit) AQLM (2×8 bit)
RTX 3090, LLaMA-2 7B 41.5 tok/s 32.2 tok/s 32.6 tok/s
Intel i9, LLaMA-2 7B 3.1 tok/s 7.0 tok/s 6.8 tok/s

5. Compression–Accuracy Trade-Offs and Calibration

Several operational variables modulate AQLM’s effectiveness:

  • Calibration set size: Gains saturate around 2,000 calibration sequences (useful range: 512–4,096).
  • Codebook number MM, bits BB, group size gg: Increasing MM improves accuracy with fixed BM/gB \cdot M / g budget but incurs higher E-step computational cost.
  • Block-wise fine-tuning: 100–300 Adam steps increase calibration time by $10$–30%30\%, securing $5$–10%10\% PPL reduction.

AQLM thus generalizes well with modest one-shot calibration effort, especially compared to direct PTQ methods.

6. Limitations and Future Prospects

AQLM is the first post-training quantization scheme to reach Pareto optimality below 3 bits/parameter on open LLMs, yielding state-of-the-art perplexity and zero-shot accuracy in the extreme quantization regime. Noted limitations:

  • Calibration cost (beam-search E-step) exceeds direct PTQ approaches (e.g., RTN/GPTQ), but remains practical for one-shot application.
  • Homogeneous codebooks: Current versions employ fixed codebook architectures; incorporating sparsity or non-uniform (layer-dependent) bit allocation could permit further gains.
  • Activation quantization: Extending AQLM to quantized activation flows (quantization-aware inference) could push bit-efficiency below 2 bits.

These results indicate the scalable adaptation of multi-codebook quantization for extreme LLM compression, facilitating efficient on-device inference at high fidelity (Egiazarian et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Additive Quantization for Language Models (AQLM).