VQ-VAE-2: Hierarchical Vector Quantized Autoencoders

Updated 20 February 2026

VQ-VAE-2 is a hierarchical deep generative model that quantizes latent representations at multiple scales to capture both global semantics and fine details.
It employs hierarchical encoders with vector quantization, EMA updates, and autoregressive priors to achieve efficient, parallel reconstruction and accelerated sampling.
Empirical evaluations on datasets like ImageNet and FFHQ demonstrate low reconstruction error, high sample diversity, and robustness against mode collapse.

Vector Quantized Variational Autoencoders (VQ-VAE-2) are a class of deep generative models that extend the vector quantized variational autoencoder architecture to a multi-scale, hierarchical framework with powerful autoregressive priors for capturing global and local data structure. They have demonstrated high-fidelity, diverse sample generation, particularly in challenging domains such as large-scale image synthesis, and represent a paradigm in combining discrete latent variable modeling with efficient neural architectures and expressive nonparametric priors (Razavi et al., 2019).

1. Model Architecture

VQ-VAE-2 organizes latent representations into a hierarchical structure, typically using two or three quantization levels. The architecture consists of:

Hierarchical encoders:
- The input image $x \in \mathbb{R}^{H \times W \times 3}$ passes through a top-level encoder $E_t$ that performs substantial spatial downsampling (by a factor of 8, yielding $h_t \in \mathbb{R}^{(H/8) \times (W/8) \times D}$ ).
- Subsequent encoders operate at finer scales, conditioned both on the original input and on upsampled, quantized codes from higher levels. For instance, a bottom-level encoder $E_b$ produces $h_b \in \mathbb{R}^{(H/4) \times (W/4) \times D}$ .
Vector quantization:
- At each level $\ell$ , features $h^{(\ell)}$ are quantized to the nearest embedding in a level-specific codebook $\{e_i^{(\ell)}\}_{i=1}^K$ using nearest-neighbor search in $\ell_2$ metric:
$z_q(x) = e_k, \quad k = \arg\min_j \| h^{(\ell)} - e_j^{(\ell)} \|_2.$ - Hyperparameters commonly used: codebook size $E_t$ 0, embedding dimensionality $E_t$ 1, and commitment cost $E_t$ 2.
Decoder:
- A feed-forward decoder $E_t$ 3 receives all quantized latent maps, applies residual blocks and strided transposed convolutions to upsample to the original resolution.
- Unlike pixel-space autoregressive decoders, $E_t$ 4 performs fast, parallel reconstruction.

This design allows VQ-VAE-2 to effectively separate global from local information, with higher layers capturing more abstract structure and lower layers refining details (Razavi et al., 2019).

2. Training Objective and Vector Quantization Mechanism

The loss function for VQ-VAE-2 generalizes the basic VQ-VAE objective to multiple levels:

$E_t$ 5

where:

$E_t$ 6 is the reconstruction loss (typically MSE between $E_t$ 7 and $E_t$ 8).
The codebook and commitment terms encourage the encoder outputs to utilize the codebook efficiently and maintain stable embedding norms.
The stop-gradient operator $E_t$ 9 ensures that only specific terms are differentiated with respect to parameters.

The quantization operation is inherently non-differentiable; VQ-VAE-2 employs the straight-through estimator during backpropagation to copy gradients from the quantized output to the encoder—an effective low-variance workaround for learning discrete latents (Oord et al., 2017).

Codebook vectors are updated via exponential moving average (EMA) in practice to enhance codebook usage and stability.

3. Hierarchical Autoregressive Priors and Sampling Procedure

To model long-range dependencies and structured semantics, VQ-VAE-2 fits powerful autoregressive priors on the discrete latents extracted from each level (after training the encoder and decoder):

Top-level prior: $h_t \in \mathbb{R}^{(H/8) \times (W/8) \times D}$ 0 is modeled using a PixelSnail architecture (combining gated convolutions and self-attention).
Bottom-level conditional prior: $h_t \in \mathbb{R}^{(H/8) \times (W/8) \times D}$ 1 is captured by a conditional PixelCNN.

Priors are optimized post-hoc by maximizing the log-likelihood of the encoded training set with respect to these models. Sampling proceeds by ancestral generation: sample $h_t \in \mathbb{R}^{(H/8) \times (W/8) \times D}$ 2 from $h_t \in \mathbb{R}^{(H/8) \times (W/8) \times D}$ 3, then $h_t \in \mathbb{R}^{(H/8) \times (W/8) \times D}$ 4 from $h_t \in \mathbb{R}^{(H/8) \times (W/8) \times D}$ 5, then reconstruct via the decoder.

Sampling in latent space is approximately $h_t \in \mathbb{R}^{(H/8) \times (W/8) \times D}$ 6 faster than comparable pixel-space autoregressive models due to the reduced spatial dimensionality (Razavi et al., 2019).

4. Experimental Results and Empirical Properties

VQ-VAE-2 demonstrates strong empirical performance on large-scale datasets:

ImageNet-256: FID $h_t \in \mathbb{R}^{(H/8) \times (W/8) \times D}$ 730 (unfiltered), FID $h_t \in \mathbb{R}^{(H/8) \times (W/8) \times D}$ 810 with classifier-based rejection sampling. By comparison, BigGAN-deep achieves FID $h_t \in \mathbb{R}^{(H/8) \times (W/8) \times D}$ 96–8; however, VQ-VAE-2 exhibits higher recall and sample diversity in precision-recall coverage analysis.
Class-conditional sample quality: VQ-VAE-2 samples, when used to train classifiers, yield higher classification accuracy scores than samples from BigGAN-deep.
FFHQ-1024: With a three-level hierarchy, VQ-VAE-2 generates globally coherent faces, covering rare modes and diverse attributes.
Reconstruction and generation: MSE on held-out data is low ( $E_b$ 00.005, ImageNet-256). Sampling from the trained priors yields diverse high-fidelity images.

Compared to VQ-VAE and other hierarchically quantized models, VQ-VAE-2 achieves lower distortion (MSE) and, especially with multiple latent levels, avoids mode collapse that plagues GAN and vanilla VAE frameworks (Razavi et al., 2019).

5. Comparison with Other Hierarchical Vector Quantization Methods

Several follow-up architectures highlight both strengths and limitations of VQ-VAE-2:

HQ-VAE (Takida et al., 2023): This model posits that hierarchical extensions of VQ-VAE such as VQ-VAE-2 can suffer from codebook or layer collapse, where codebooks are underutilized in upper layers, degrading reconstruction. HQ-VAE introduces a variational Bayesian objective with stochastic quantization, learned codebook noise scales, and Gumbel-softmax reparameterization, leading to greater codebook utilization and improved reconstruction metrics (e.g., on ImageNet-256, HQ-VAE achieves RMSE = 4.60, LPIPS = 0.096, SSIM = 0.855, compared to VQ-VAE-2’s RMSE = 6.07, LPIPS = 0.265, SSIM = 0.751).
HR-VQVAE (Adiban et al., 2022): Employs residual quantization per layer and local codebooks, overcoming codebook collapse at high capacity and providing $E_b$ 1 faster decoding than VQ-VAE-2. Empirical FID and MSE are accordingly reduced (e.g., FFHQ: HR-VQVAE FID = 1.26, MSE = 0.00163; VQ-VAE-2 FID = 1.92, MSE = 0.00195).

Hierarchical residual or stochastic structures can address the inefficiency of flat, independently quantized hierarchies in VQ-VAE-2, especially in deep or large-capacity regimes.

6. Practical Considerations and Limitations

Scalability: VQ-VAE-2 supports large codebooks ( $E_b$ 2) and embedding dimensions ( $E_b$ 3), but excessive codebook size can incur computational cost and underutilization.
Training efficiency: Inference and sampling remain efficient due to feed-forward decoding and latent-space autoregression, despite high parameter count in autoregressive priors and codebooks.
Application scope: Beyond image generation and compression, VQ-VAE-2-style architectures have been deployed for unsupervised speech modeling (phoneme discovery, speaker conversion) and video modeling, leveraging the power of discrete hierarchies for both generative and representation-learning tasks (Oord et al., 2017).
Mode collapse: Maximum likelihood estimation inherently avoids mode collapse, unlike GANs; VQ-VAE-2 covers all data modes, trading precision and recall via rejection sampling rather than truncation heuristics.

7. Theoretical Significance and Extensions

VQ-VAE-2 demonstrates the effectiveness of combining discrete, nonparametric vector quantization with hierarchical latent modeling and expressive non-Gaussian priors. It provides a practical solution to posterior collapse and bridges VAEs and powerful generative models such as PixelCNN and self-attention-based sequence models. Subsequent research has refined quantization mechanisms (e.g., stochastic quantization, residual codebooks, variational ELBOs), enhanced codebook regularization, and broadened application domains, illustrating the centrality of hierarchical vector quantization in contemporary deep generative modeling (Takida et al., 2023, Adiban et al., 2022).

Markdown Report Issue Upgrade to Chat

References (4)

Generating Diverse High-Fidelity Images with VQ-VAE-2 (2019)

Neural Discrete Representation Learning (2017)

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes (2023)

Hierarchical Residual Learning Based Vector Quantized Variational Autoencoder for Image Reconstruction and Generation (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vector Quantized Variational AutoEncoders (VQ-VAE-2).