Residual Quantization VAE (RQ-VAE)

Updated 22 January 2026

RQ-VAE is a hierarchical autoencoder that stacks residual quantizers to overcome standard VQ limitations and enhance latent representation.
It employs a unified codebook across multiple quantization stages, drastically reducing code sequence length while maintaining tractable rate-distortion performance.
RQ-VAE has been effectively applied in image, text, and 3D motion synthesis, achieving state-of-the-art reconstruction fidelity and sampling speed.

Residual Quantization Variational Autoencoder (RQ-VAE) is a hierarchical vector quantized autoencoder framework designed to address the limitations of standard vector quantization (VQ-VAE) in learning flexible, highly efficient discrete representations for generative modeling tasks. Unlike single-layer VQ-VAEs, RQ-VAE achieves exponentially greater latent expressivity with fixed-size codebooks by stacking residual quantizers, enabling high-fidelity compression—including dramatic reductions in code sequence length—while maintaining tractable rate-distortion and robust codebook utilization (Lee et al., 2022).

1. Mathematical Framework and Quantization Principle

The defining feature of RQ-VAE is its residual quantization of feature maps. Let $z \in \mathbb{R}^n$ denote a continuous latent vector (e.g., one cell of the encoder’s spatial feature map). Given codebook $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ and quantization depth $D$ , residual quantization proceeds as:

$r_0 = z$ , $\hat{z}^{(0)} = 0$
For each stage $d=1,\dots,D$ $d = 1, \dots, D$ :
- $r_d = z - \hat{z}^{(d-1)}$
- $k_d = \arg\min_{k\in\{1...K\}}\|r_d - e(k)\|^2$
- $\hat{z}_d = e(k_d)$
- $\hat{z}^{(d)} = \hat{z}^{(d-1)} + \hat{z}_d$

At the feature-map level, for input $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 0 and encoder $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 1, the encoded feature map $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 2 (with $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 3 for downsampling factor $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 4) is quantized independently at each spatial location to a stack of codes $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 5, yielding reconstructed features $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 6 (Lee et al., 2022).

The RQ-VAE objective generalizes classical VQ-VAE losses to depth $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 7:

Reconstruction loss: $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 8
Commitment loss: $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 9
Total loss: $D$ 0

where $D$ 1 is the stop-gradient operator, and $D$ 2 is the decoder.

Codebook learning is performed via exponential moving average (EMA) updates of assigned features, and a straight-through estimator enables differentiable quantization, setting $D$ 3 at the embedding (Lee et al., 2022). Optional adversarial and perceptual objectives (e.g., patch-GAN, deep feature loss) can further enhance output sharpness.

2. Architecture, Codebooks, and Hierarchical Structure

A typical RQ-VAE architecture consists of:

Encoder $D$ 4: stack of convolutional residual blocks, with aggressive spatial downsampling ( $D$ 5 for $D$ 6 inputs, producing $D$ 7 feature maps). Hidden channel dimension $D$ 8.
Decoder $D$ 9: symmetric upsampling/residual block network.
Codebook $r_0 = z$ 0: a single shared codebook of size $r_0 = z$ 1 (e.g., $r_0 = z$ 2 in high-res image experiments), used jointly by all $r_0 = z$ 3 quantization stages. Only one set of embeddings need be maintained; this ensures every codeword is used at every residual stage.
Quantization depth $r_0 = z$ 4: Empirically $r_0 = z$ 5 suffices for lossless $r_0 = z$ 6 compression; increasing $r_0 = z$ 7 (see Table 1) further reduces distortion, but increases AR decoding cost.

Quantization Depth $r_0 = z$ 8	Codebook Size $r_0 = z$ 9	Feature Map	rFID
1 (VQ-GAN, 8x8)	$\hat{z}^{(0)} = 0$ 0k	$\hat{z}^{(0)} = 0$ 1	17.1
4 (RQ-VAE, 8x8)	$\hat{z}^{(0)} = 0$ 2k	$\hat{z}^{(0)} = 0$ 3	4.7
8	$\hat{z}^{(0)} = 0$ 4k	$\hat{z}^{(0)} = 0$ 5	2.7
16	$\hat{z}^{(0)} = 0$ 6k	$\hat{z}^{(0)} = 0$ 7	1.8

This exponential latent space ( $\hat{z}^{(0)} = 0$ 8) enables high-rate, high-fidelity quantization even at heavily reduced spatial resolutions, supporting ultra-fast autoregressive modeling (Lee et al., 2022).

Variants such as HR-VQVAE (Adiban et al., 2022), MoSa's hierarchical RQ-VAE (Liu et al., 3 Nov 2025), and stochastic RSQ-VAE (Takida et al., 2023) generalize this principle by (i) supporting per-layer codebooks, (ii) allowing different residual scales, and (iii) employing Bayesian entropy-based regularization to mitigate codebook collapse.

3. Rate–Distortion Analysis and Codebook Utilization

The rate-distortion advantage of RQ-VAE arises from the hierarchical composition of quantizers. A single $\hat{z}^{(0)} = 0$ 9-codebook VQ partitions $d=1,\dots,D$ 0 into $d=1,\dots,D$ 1 cells, subject to the distance-minimizing error $d=1,\dots,D$ 2. In contrast, an $d=1,\dots,D$ 3-layer residual-quantized model with depth $d=1,\dots,D$ 4 achieves $d=1,\dots,D$ 5 unique code sums, substantially increasing representational power without an exponential codebook size (Lee et al., 2022).

For image encoding:

VQ-VAE (16x16x1, $d=1,\dots,D$ 6): rFID $d=1,\dots,D$ 7
RQ-VAE (8x8x4, $d=1,\dots,D$ 8): rFID $d=1,\dots,D$ 9, matching the fidelity but reducing the autoregressive sequence length from $r_d = z - \hat{z}^{(d-1)}$ 0 to $r_d = z - \hat{z}^{(d-1)}$ 1.
Further increasing $r_d = z - \hat{z}^{(d-1)}$ 2 to $r_d = z - \hat{z}^{(d-1)}$ 3 or $r_d = z - \hat{z}^{(d-1)}$ 4 pushes rFID as low as $r_d = z - \hat{z}^{(d-1)}$ 5 and $r_d = z - \hat{z}^{(d-1)}$ 6, respectively.

Ablations consistently demonstrate that, for fixed $r_d = z - \hat{z}^{(d-1)}$ 7, increasing $r_d = z - \hat{z}^{(d-1)}$ 8 is vastly more effective at reducing distortion than growing the codebook size (Lee et al., 2022).

Later extensions provide additional codebook usage guarantees. HR-VQVAE (Adiban et al., 2022) employs hierarchical codebook conditioning and per-layer contrastive targets, while HQ-VAE (Takida et al., 2023) incorporates variational entropy penalties to maximize code utilization and eliminate code collapse even at large $r_d = z - \hat{z}^{(d-1)}$ 9.

4. Encoding and Decoding Algorithms

The full RQ-VAE encoding process proceeds as:

Encode:
- For each spatial location $k_d = \arg\min_{k\in\{1...K\}}\|r_d - e(k)\|^2$ $k_{d} = ar g min_{k \in {1... K}} ∥ r_{d} - e (k) ∥^{2}$ 0 in $k_d = \arg\min_{k\in\{1...K\}}\|r_d - e(k)\|^2$ $k_{d} = ar g min_{k \in {1... K}} ∥ r_{d} - e (k) ∥^{2}$ 1:
  - Initialize $k_d = \arg\min_{k\in\{1...K\}}\|r_d - e(k)\|^2$ 2, $k_d = \arg\min_{k\in\{1...K\}}\|r_d - e(k)\|^2$ 3.
  - For $k_d = \arg\min_{k\in\{1...K\}}\|r_d - e(k)\|^2$ 4:
  - $k_d = \arg\min_{k\in\{1...K\}}\|r_d - e(k)\|^2$ 5
  - $k_d = \arg\min_{k\in\{1...K\}}\|r_d - e(k)\|^2$ 6
  - $k_d = \arg\min_{k\in\{1...K\}}\|r_d - e(k)\|^2$ 7
  - $k_d = \arg\min_{k\in\{1...K\}}\|r_d - e(k)\|^2$ 8
  - Store code indices $k_d = \arg\min_{k\in\{1...K\}}\|r_d - e(k)\|^2$ 9
Decode:
- For all $\hat{z}_d = e(k_d)$ 0, sum the codebook embeddings for each stack of codes to form $\hat{z}_d = e(k_d)$ 1
- Image reconstruction $\hat{z}_d = e(k_d)$ 2

Gradient flow:

Losses ( $\hat{z}_d = e(k_d)$ 3) propagate gradients through $\hat{z}_d = e(k_d)$ 4 and $\hat{z}_d = e(k_d)$ 5.
The straight-through estimator routes gradients from codeword assignments to the encoder outputs.
Codebook $\hat{z}_d = e(k_d)$ 6 updates occur via EMA of encoder outputs (Lee et al., 2022).

Extensions in MoSa (Liu et al., 3 Nov 2025) introduce scale-varying downsampling/upsampling at each quantization stage, with reconstructed latents formed by upsampling the dequantized embeddings back to full length prior to computing the residual. This supports arbitrarily coarse-to-fine token hierarchies, beneficial for tasks such as temporal motion synthesis.

5. Applications, Empirical Results, and Comparisons

RQ-VAE frameworks are deployed in multiple domains:

Image generation: RQ-VAE enables efficient AR modeling by compressing $\hat{z}_d = e(k_d)$ 7 images to $\hat{z}_d = e(k_d)$ 8 discrete codes. When coupled with an RQ-Transformer, this yields FID and IS statistics superior to VQ-GAN and competitive methods, with 4-7× faster sampling (Lee et al., 2022).
Class-conditional generation (ImageNet 256 $\hat{z}_d = e(k_d)$ 9): RQ-Transformer ( $\hat{z}^{(d)} = \hat{z}^{(d-1)} + \hat{z}_d$ 0B params, no rejection) achieves FID= $\hat{z}^{(d)} = \hat{z}^{(d-1)} + \hat{z}_d$ 1, IS= $\hat{z}^{(d)} = \hat{z}^{(d-1)} + \hat{z}_d$ 2 (vs. VQ-GAN $\hat{z}^{(d)} = \hat{z}^{(d-1)} + \hat{z}_d$ 3/74.3). With rejection sampling, FID= $\hat{z}^{(d)} = \hat{z}^{(d-1)} + \hat{z}_d$ 4, IS= $\hat{z}^{(d)} = \hat{z}^{(d-1)} + \hat{z}_d$ 5 (state-of-the-art AR).
Text-conditional generation: On CC-3M, RQ-Transformer (650M params) yields FID= $\hat{z}^{(d)} = \hat{z}^{(d-1)} + \hat{z}_d$ 6, CLIP-sim= $\hat{z}^{(d)} = \hat{z}^{(d-1)} + \hat{z}_d$ 7; VQ-GAN yields $\hat{z}^{(d)} = \hat{z}^{(d-1)} + \hat{z}_d$ 8/0.20 (Lee et al., 2022).
3D motion synthesis: MoSa’s RQ-VAE compresses 64-frame motion windows into 10 hierarchical code groups at variable scales, achieving state-of-the-art FID ( $\hat{z}^{(d)} = \hat{z}^{(d-1)} + \hat{z}_d$ 9 on Motion-X, vs. $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 00 for MoMask) and $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 01 lower inference time (Liu et al., 3 Nov 2025).
General discrete representation learning: Hierarchical and stochastic variants (HR-VQVAE (Adiban et al., 2022), HQ-VAE (Takida et al., 2023)) outperform VQ-VAE/VQ-VAE-2 on reconstruction MSE and FID across FFHQ, ImageNet, CIFAR-10, MNIST, and audio datasets.

6. Variants, Theoretical Advances, and Mitigation of Codebook Collapse

Several extensions sharpen the RQ-VAE paradigm:

HR-VQVAE (Adiban et al., 2022): Implements multi-layer residual VQ with per-layer codebooks and hierarchical linkage. Layer $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 02's codebook eligibility is indexed by codes at layer $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 03, restricting the effective code dictionary per location and avoiding collapse.
MoSa RQ-VAE (Liu et al., 3 Nov 2025): Utilizes nonlinear scale schedules and MTPS (Multi-scale Token Preservation Strategy) to align AR inference with hierarchical quantization groups, allowing transformer models to sample all tokens at each scale in parallel with $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 04 total AR steps.
HQ-VAE (RSQ-VAE instance) (Takida et al., 2023): Recasts RQ-VAE as a Bayesian model, introducing stochastic (Gaussian-categorical) code assignment, entropy regularization for codebook usage, and joint learning of quantization strengths. This stochastically annealed training cures codebook and layer collapse, eliminates heuristic penalties (no $C = \{e(k)\in\mathbb{R}^n : k=1...K\}$ 05, EMA, or resets), and converges to full codebook usage with improved RMSE/LPIPS/SSIM.

A plausible implication is that variational residual quantization (e.g., HQ-VAE) should increasingly supplant heuristic-hardened, deterministic approaches for settings demanding both deep hierarchies and stable code allocation.

7. Significance and Outlook

The RQ-VAE family provides an expressive and efficient method for learning high-rate discrete representations with tractable quantization complexity, substantially increasing downstream generative modeling capacity and speed. Its flexible architecture supports domain generalization (images, motions, audio) and is compatible with both deterministic and variational Bayesian training. By enabling discrete representations whose cardinality grows exponentially in quantization depth (without codebook inflation), RQ-VAE achieves state-of-the-art rates on challenging benchmarks and is robust to codebook collapse, a frequent failure mode in non-residual models (Lee et al., 2022, Adiban et al., 2022, Takida et al., 2023, Liu et al., 3 Nov 2025). This positions RQ-VAE and its variants as fundamental to the evolution of highly scalable discrete latent variable models for modern deep generative applications.