RaBitQ Quantization for LLM and ANN Compression

Updated 18 January 2026

RaBitQ quantization is a randomized method that applies random rotations and unbiased, coordinate-wise rounding to efficiently compress high-dimensional vectors.
It extends to LLM post-training by adapting weight quantization with RaBitQ-H, ensuring competitive accuracy and improved computational efficiency.
The approach delivers rigorous error bounds and data-efficient calibration, significantly enhancing ANN search performance and deep model inference.

RaBitQ quantization is a randomized quantization methodology, originally developed for high-dimensional vector compression in approximate nearest neighbor search (ANNS), which has been extended to efficient and accurate post-training quantization (PTQ) of LLMs. Its foundational principle is the combination of random or structured orthogonal transforms with unbiased, coordinate-wise quantization and theoretically rigorous error control. Recent extensions—such as RaBitQ-H within the RaanA framework—specialize the approach for matrix (weight) quantization in neural networks, yielding competitive accuracy with notable data and computational efficiency (Yang et al., 29 Mar 2025, Gao et al., 2024, Gao et al., 2024).

1. Foundations of RaBitQ Quantization

RaBitQ encodes a $d$ -dimensional vector $\mathbf{x} \in \mathbb{R}^d$ into $b$ bits per coordinate, preserving unbiased inner product estimation with quantized codes. The method begins by applying a random orthogonal rotation $P\in\mathbb{R}^{d\times d}$ (typically a Johnson–Lindenstrauss (JL) transform such as a Hadamard plus random sign flip), yielding $\mathbf{u}=P\mathbf{x}$ (Yang et al., 29 Mar 2025, Gao et al., 2024).

Each coordinate $u_i$ is independently quantized to $\bar{u}_i\in\{0,\ldots,2^b-1\}$ by randomized rounding onto a uniform quantization grid. The quantized code consists of the vector $\bar{\mathbf{u}}$ and a real-valued scaling factor $t$ . This process yields a representation enabling unbiased estimation of the original inner product $\langle\mathbf{x},\mathbf{y}\rangle$ by reconstructing quantized values and applying the inverse transform to queries (Yang et al., 29 Mar 2025).

The original RaBitQ codebook for $b=1$ is the hypercube $\{\pm1/\sqrt{d}\}^d$ , generalized to create uniformly spaced, normalized grids for $b>1$ (Gao et al., 2024). The quantization leverages structure and randomness to maintain unbiasedness and $O(2^{-b}\sqrt{d})$ error in inner product estimation with high probability.

2. Algorithmic Schemes and Extensions

RaBitQ yields:

Code Construction: For each input, perform rotation, coordinate-wise quantization, and store associated rescaling.
Decoding and Inference: Given a query $\mathbf{y}$ , transform as $\mathbf{v}=P\mathbf{y}$ (using the same $P$ ), and compute the unbiased inner product estimator:

$\widehat{\langle \mathbf{x},\mathbf{y} \rangle} = \langle t(\bar{\mathbf{u}} - c_b\mathbf{1}),\, \mathbf{v}\rangle,\quad c_b=\tfrac{2^b-1}{2}$

Theoretical Error: Ensures that $\mathbb{E}[\widehat{\langle \mathbf{x},\mathbf{y} \rangle}]=\langle \mathbf{x},\mathbf{y}\rangle$ , and with high probability

$|\widehat{\langle \mathbf{x},\mathbf{y}\rangle} - \langle \mathbf{x},\mathbf{y}\rangle| \lesssim \|\mathbf{x}\|\|\mathbf{y}\|\,2^{-b}\sqrt{d}$

Optimal bit allocation achieves $\varepsilon\|\mathbf{x}\|\|\mathbf{y}\|$ error with $b = \Theta(\log (\sqrt{d}/\varepsilon))$ for small $\varepsilon$ .

Extended RaBitQ (for $b>1$ ) constructs normalized grid codebooks, applies a random rotation, quantizes by nearest neighbor search within the grid, and admits an asymptotically optimal tradeoff between bits used and error achieved, matching the $O\left(\frac{2^{-b}}{\sqrt{d}}\right)$ decay observed empirically and proven theoretically (Gao et al., 2024).

3. RaBitQ-H: Adaptation to LLM Weight Quantization

RaBitQ-H is a variant designed for quantizing the weight matrices in LLMs by replacing the generic random rotation with a Subsampled Randomized Hadamard Transform (RHT), which reduces computational cost. The RHT step samples a diagonal sign matrix $D$ whose entries are random Rademacher variables, composing with the Hadamard transform $H_d$ , i.e., $W' = \frac{1}{\sqrt d}H_d(DW)$ for each weight matrix $W$ (Yang et al., 29 Mar 2025).

Quantization follows as per standard RaBitQ: each column of $W'$ is quantized to $b$ bits via randomized rounding, returning $(\widehat{W}',r,D)$ . Matrix inference for input $X$ proceeds with the transform $X' = \frac{1}{\sqrt d}H_d(DX^\top)^\top$ , then the output is estimated as

$Y \approx X' \widehat{W}'\operatorname{diag}(r) - z\,r^\top$

where $z$ is a precomputed bias term depending on the quantization.

Theoretical analysis demonstrates that each coordinate-wise error $\epsilon_{ij}$ satisfies a sub-Gaussian tail, and layer-wise errors propagate as

$\operatorname{Err}_k(b;x) \lesssim 2^{-b} \sqrt{\frac{\log c}{d}\|J_k\|_F\|X\|_F\|W\|_F}$

where $J_k$ is the downstream function's Jacobian, ensuring bounded degradation at a controllable rate (Yang et al., 29 Mar 2025).

4. AllocateBits: Layer-Wise Mixed-Precision Allocation

The AllocateBits module solves the problem of distributing a global bit budget $R$ optimally across $L$ layers to minimize total quantization error. For each layer $k$ , an error coefficient

$\alpha_k \approx \frac{1}{\sqrt{d_k}\|J_k\|_F\|X^{(k)}\|_F\|W^{(k)}\|_F}$

is computed to model layerwise quantization sensitivity. The bit-allocation problem is posed as the integer program:

$\min_{b_1, ..., b_L \in \mathscr{B}} \sum_{k=1}^L \alpha_k 2^{-b_k} \quad \text{subject to} \quad \sum_{k=1}^L b_k m_k \leq R$

where $m_k = d_k c_k$ is the parameter count per layer. The solution leverages a dynamic programming (DP) approach with a greatest common divisor (GCD) reduction to ensure computational tractability, which is effective given the standard architecture dimensions found in LLMs (Yang et al., 29 Mar 2025).

Calibration to estimate Jacobians and norms can be performed in few-shot (using only 5 real samples) or zero-shot (using one synthetic prompt) settings, making the approach extremely data-efficient relative to prior methods.

5. Empirical Performance and Complexity

Evaluation on WikiText2 with LLaMA-family models and broad-scale ANN benchmarks reveals that RaBitQ-H and extended RaBitQ attain state-of-the-art performance in both accuracy (measured by perplexity for LLMs and recall for ANN) and efficiency (runtime, memory use) (Yang et al., 29 Mar 2025, Gao et al., 2024, Gao et al., 2024).

Table: LLaMA Models, Perplexity vs Avg. Bits (lower is better)

Model/Size	GPTQ (3+)	OmniQ (3+)	Quip# (3)	RaanA (3.3)
LLaMA-7B	8.81	6.15	6.29	6.10
LLaMA-13B	5.66	5.44	5.52	5.38
LLaMA2-70B	3.85	3.78	3.71	3.59

At 2+ bits, RaanA-2.3 (few-shot) perplexity is comparable to OmniQuant and Quip#.
At 3.3 bits, RaanA matches or slightly outperforms all baselines.
Zero-shot calibration incurs only ≈1% relative perplexity degradation.
RaanA quantizes LLaMA2-70B in 3293 s (few-shot, avg 2.1 bits, 4×A100+2×EPYC), which is 10–20× faster than Quip# (≈10 hr).

For ANN search, the extended RaBitQ achieves Recall@1 ≈ 0.90–0.99 (with 4–7 bits), outperforms PQ/OPQ and LVQ on both accuracy and QPS, and exhibits storage overheads of $B\cdot D$ bits per vector plus two floats (Gao et al., 2024, Gao et al., 2024).

6. Implementation and Optimization

RaBitQ and its descendants exploit structured orthogonal transforms for computational efficiency, employing AVX2/AVX512 SIMD instructions for inner product retrieval acceleration. Codebooks are implicitly defined using the random seed for random rotations, minimizing storage (Gao et al., 2024).

For the extended RaBitQ:

Code construction runs in $O(D\log D)$ per vector using Hadamard transforms, and quantization leverages fast heap-based enumeration over codebook grid points.
At inference/query stage, operations are performed in $O(D)$ per candidate using SIMD, with support for two-stage estimation and hierarchical index structures (e.g., IVF).
Data is packed as contiguous bit strings and floats, facilitating efficient memory access.

7. Significance and Impact

RaBitQ quantization introduces theoretically guaranteed, unbiased quantization for both ANNS and deep learning model compression. Its error guarantees, combined with empirical performance and computational efficiency, represent a significant advance over product quantization approaches, which lack formal error bounds and can fail in adverse regimes (Gao et al., 2024). The adaptation to LLM weights via RaBitQ-H and the AllocateBits optimizer yields a practically viable, fast, and data-efficient PTQ pipeline, as demonstrated on leading generative models (Yang et al., 29 Mar 2025).

The underlying principles and practical implementations have enabled new capabilities in resource-constrained inference, data center ANN search, and rapid LLM deployment. A plausible implication is the increasing adoption of randomized, highly structured quantization in both systems and large-model inference pipelines, due to the method’s blend of mathematical guarantees and scalable engineering.

Markdown Report Issue Upgrade to Chat

References (3)

RaanA: A Fast, Flexible, and Data-Efficient Post-Training Quantization Algorithm (2025)

Practical and Asymptotically Optimal Quantization of High-Dimensional Vectors in Euclidean Space for Approximate Nearest Neighbor Search (2024)

RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RaBitQ Quantization.

RaBitQ Quantization for LLM and ANN Compression

1. Foundations of RaBitQ Quantization

2. Algorithmic Schemes and Extensions

3. RaBitQ-H: Adaptation to LLM Weight Quantization

4. AllocateBits: Layer-Wise Mixed-Precision Allocation

5. Empirical Performance and Complexity

6. Implementation and Optimization

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RaBitQ Quantization for LLM and ANN Compression

1. Foundations of RaBitQ Quantization

2. Algorithmic Schemes and Extensions

3. RaBitQ-H: Adaptation to LLM Weight Quantization

4. AllocateBits: Layer-Wise Mixed-Precision Allocation

5. Empirical Performance and Complexity

6. Implementation and Optimization

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research