Papers
Topics
Authors
Recent
Search
2000 character limit reached

RaBitQ Quantization for LLM and ANN Compression

Updated 18 January 2026
  • RaBitQ quantization is a randomized method that applies random rotations and unbiased, coordinate-wise rounding to efficiently compress high-dimensional vectors.
  • It extends to LLM post-training by adapting weight quantization with RaBitQ-H, ensuring competitive accuracy and improved computational efficiency.
  • The approach delivers rigorous error bounds and data-efficient calibration, significantly enhancing ANN search performance and deep model inference.

RaBitQ quantization is a randomized quantization methodology, originally developed for high-dimensional vector compression in approximate nearest neighbor search (ANNS), which has been extended to efficient and accurate post-training quantization (PTQ) of LLMs. Its foundational principle is the combination of random or structured orthogonal transforms with unbiased, coordinate-wise quantization and theoretically rigorous error control. Recent extensions—such as RaBitQ-H within the RaanA framework—specialize the approach for matrix (weight) quantization in neural networks, yielding competitive accuracy with notable data and computational efficiency (Yang et al., 29 Mar 2025, Gao et al., 2024, Gao et al., 2024).

1. Foundations of RaBitQ Quantization

RaBitQ encodes a dd-dimensional vector xRd\mathbf{x} \in \mathbb{R}^d into bb bits per coordinate, preserving unbiased inner product estimation with quantized codes. The method begins by applying a random orthogonal rotation PRd×dP\in\mathbb{R}^{d\times d} (typically a Johnson–Lindenstrauss (JL) transform such as a Hadamard plus random sign flip), yielding u=Px\mathbf{u}=P\mathbf{x} (Yang et al., 29 Mar 2025, Gao et al., 2024).

Each coordinate uiu_i is independently quantized to uˉi{0,,2b1}\bar{u}_i\in\{0,\ldots,2^b-1\} by randomized rounding onto a uniform quantization grid. The quantized code consists of the vector uˉ\bar{\mathbf{u}} and a real-valued scaling factor tt. This process yields a representation enabling unbiased estimation of the original inner product x,y\langle\mathbf{x},\mathbf{y}\rangle by reconstructing quantized values and applying the inverse transform to queries (Yang et al., 29 Mar 2025).

The original RaBitQ codebook for b=1b=1 is the hypercube {±1/d}d\{\pm1/\sqrt{d}\}^d, generalized to create uniformly spaced, normalized grids for b>1b>1 (Gao et al., 2024). The quantization leverages structure and randomness to maintain unbiasedness and O(2bd)O(2^{-b}\sqrt{d}) error in inner product estimation with high probability.

2. Algorithmic Schemes and Extensions

RaBitQ yields:

  • Code Construction: For each input, perform rotation, coordinate-wise quantization, and store associated rescaling.
  • Decoding and Inference: Given a query y\mathbf{y}, transform as v=Py\mathbf{v}=P\mathbf{y} (using the same PP), and compute the unbiased inner product estimator:

x,y^=t(uˉcb1),v,cb=2b12\widehat{\langle \mathbf{x},\mathbf{y} \rangle} = \langle t(\bar{\mathbf{u}} - c_b\mathbf{1}),\, \mathbf{v}\rangle,\quad c_b=\tfrac{2^b-1}{2}

  • Theoretical Error: Ensures that E[x,y^]=x,y\mathbb{E}[\widehat{\langle \mathbf{x},\mathbf{y} \rangle}]=\langle \mathbf{x},\mathbf{y}\rangle, and with high probability

x,y^x,yxy2bd|\widehat{\langle \mathbf{x},\mathbf{y}\rangle} - \langle \mathbf{x},\mathbf{y}\rangle| \lesssim \|\mathbf{x}\|\|\mathbf{y}\|\,2^{-b}\sqrt{d}

Optimal bit allocation achieves εxy\varepsilon\|\mathbf{x}\|\|\mathbf{y}\| error with b=Θ(log(d/ε))b = \Theta(\log (\sqrt{d}/\varepsilon)) for small ε\varepsilon.

Extended RaBitQ (for b>1b>1) constructs normalized grid codebooks, applies a random rotation, quantizes by nearest neighbor search within the grid, and admits an asymptotically optimal tradeoff between bits used and error achieved, matching the O(2bd)O\left(\frac{2^{-b}}{\sqrt{d}}\right) decay observed empirically and proven theoretically (Gao et al., 2024).

3. RaBitQ-H: Adaptation to LLM Weight Quantization

RaBitQ-H is a variant designed for quantizing the weight matrices in LLMs by replacing the generic random rotation with a Subsampled Randomized Hadamard Transform (RHT), which reduces computational cost. The RHT step samples a diagonal sign matrix DD whose entries are random Rademacher variables, composing with the Hadamard transform HdH_d, i.e., W=1dHd(DW)W' = \frac{1}{\sqrt d}H_d(DW) for each weight matrix WW (Yang et al., 29 Mar 2025).

Quantization follows as per standard RaBitQ: each column of WW' is quantized to bb bits via randomized rounding, returning (W^,r,D)(\widehat{W}',r,D). Matrix inference for input XX proceeds with the transform X=1dHd(DX)X' = \frac{1}{\sqrt d}H_d(DX^\top)^\top, then the output is estimated as

YXW^diag(r)zrY \approx X' \widehat{W}'\operatorname{diag}(r) - z\,r^\top

where zz is a precomputed bias term depending on the quantization.

Theoretical analysis demonstrates that each coordinate-wise error ϵij\epsilon_{ij} satisfies a sub-Gaussian tail, and layer-wise errors propagate as

Errk(b;x)2blogcdJkFXFWF\operatorname{Err}_k(b;x) \lesssim 2^{-b} \sqrt{\frac{\log c}{d}\|J_k\|_F\|X\|_F\|W\|_F}

where JkJ_k is the downstream function's Jacobian, ensuring bounded degradation at a controllable rate (Yang et al., 29 Mar 2025).

4. AllocateBits: Layer-Wise Mixed-Precision Allocation

The AllocateBits module solves the problem of distributing a global bit budget RR optimally across LL layers to minimize total quantization error. For each layer kk, an error coefficient

αk1dkJkFX(k)FW(k)F\alpha_k \approx \frac{1}{\sqrt{d_k}\|J_k\|_F\|X^{(k)}\|_F\|W^{(k)}\|_F}

is computed to model layerwise quantization sensitivity. The bit-allocation problem is posed as the integer program:

minb1,...,bLBk=1Lαk2bksubject tok=1LbkmkR\min_{b_1, ..., b_L \in \mathscr{B}} \sum_{k=1}^L \alpha_k 2^{-b_k} \quad \text{subject to} \quad \sum_{k=1}^L b_k m_k \leq R

where mk=dkckm_k = d_k c_k is the parameter count per layer. The solution leverages a dynamic programming (DP) approach with a greatest common divisor (GCD) reduction to ensure computational tractability, which is effective given the standard architecture dimensions found in LLMs (Yang et al., 29 Mar 2025).

Calibration to estimate Jacobians and norms can be performed in few-shot (using only 5 real samples) or zero-shot (using one synthetic prompt) settings, making the approach extremely data-efficient relative to prior methods.

5. Empirical Performance and Complexity

Evaluation on WikiText2 with LLaMA-family models and broad-scale ANN benchmarks reveals that RaBitQ-H and extended RaBitQ attain state-of-the-art performance in both accuracy (measured by perplexity for LLMs and recall for ANN) and efficiency (runtime, memory use) (Yang et al., 29 Mar 2025, Gao et al., 2024, Gao et al., 2024).

Table: LLaMA Models, Perplexity vs Avg. Bits (lower is better)

Model/Size GPTQ (3+) OmniQ (3+) Quip# (3) RaanA (3.3)
LLaMA-7B 8.81 6.15 6.29 6.10
LLaMA-13B 5.66 5.44 5.52 5.38
LLaMA2-70B 3.85 3.78 3.71 3.59
  • At 2+ bits, RaanA-2.3 (few-shot) perplexity is comparable to OmniQuant and Quip#.
  • At 3.3 bits, RaanA matches or slightly outperforms all baselines.
  • Zero-shot calibration incurs only ≈1% relative perplexity degradation.
  • RaanA quantizes LLaMA2-70B in 3293 s (few-shot, avg 2.1 bits, 4×A100+2×EPYC), which is 10–20× faster than Quip# (≈10 hr).

For ANN search, the extended RaBitQ achieves Recall@1 ≈ 0.90–0.99 (with 4–7 bits), outperforms PQ/OPQ and LVQ on both accuracy and QPS, and exhibits storage overheads of BDB\cdot D bits per vector plus two floats (Gao et al., 2024, Gao et al., 2024).

6. Implementation and Optimization

RaBitQ and its descendants exploit structured orthogonal transforms for computational efficiency, employing AVX2/AVX512 SIMD instructions for inner product retrieval acceleration. Codebooks are implicitly defined using the random seed for random rotations, minimizing storage (Gao et al., 2024).

For the extended RaBitQ:

  • Code construction runs in O(DlogD)O(D\log D) per vector using Hadamard transforms, and quantization leverages fast heap-based enumeration over codebook grid points.
  • At inference/query stage, operations are performed in O(D)O(D) per candidate using SIMD, with support for two-stage estimation and hierarchical index structures (e.g., IVF).
  • Data is packed as contiguous bit strings and floats, facilitating efficient memory access.

7. Significance and Impact

RaBitQ quantization introduces theoretically guaranteed, unbiased quantization for both ANNS and deep learning model compression. Its error guarantees, combined with empirical performance and computational efficiency, represent a significant advance over product quantization approaches, which lack formal error bounds and can fail in adverse regimes (Gao et al., 2024). The adaptation to LLM weights via RaBitQ-H and the AllocateBits optimizer yields a practically viable, fast, and data-efficient PTQ pipeline, as demonstrated on leading generative models (Yang et al., 29 Mar 2025).

The underlying principles and practical implementations have enabled new capabilities in resource-constrained inference, data center ANN search, and rapid LLM deployment. A plausible implication is the increasing adoption of randomized, highly structured quantization in both systems and large-model inference pipelines, due to the method’s blend of mathematical guarantees and scalable engineering.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RaBitQ Quantization.