Quantization-Aware Hierarchical VAEs

Updated 8 February 2026

Quantization-aware Hierarchical VAEs are generative models that integrate layered discrete latent representations with vector or scalar quantization for efficient compression and modeling.
They employ diverse quantization schemes—including deterministic, stochastic, and Bayesian methods—combined with residual coding to optimize reconstruction and prevent codebook collapse.
The models achieve superior rate-distortion performance and faster decoding speeds through explicit entropy regularization and hierarchical, multi-scale quantization architectures.

Quantization-aware Hierarchical Variational Autoencoders (VAEs) are a class of generative models that combine hierarchical latent variable structures with explicit vector or scalar quantization mechanisms. These models enable compact discrete representations for tasks such as image and audio compression, likelihood-based modeling, and high-fidelity generative modeling, particularly at low bitrates. Key methodological advances include probabilistic or stochastic quantization, hierarchical or residual code structures, explicit entropy regularization, and efficient codebook utilization control.

1. Architectural Principles and Latent Hierarchy

Quantization-aware hierarchical VAEs are built by composing multiple layers of quantized encoding and decoding stages. Each layer operates either on raw data (e.g., pixels) or latent representations produced by preceding layers.

Hierarchical Stacking: Most designs organize the latent space hierarchically, forming a Markov chain or directed graphical model of discrete latent variables. In "Hierarchical Quantized Autoencoders" (HQA), each layer ℓ receives as input the embedding $z_e^{(\ell-1)}$ from the previous encoder, encodes this to a new embedding $z_e^{(\ell)}$ , quantizes it to $z_q^{(\ell)}$ , and decodes back to reconstruct $z_e^{(\ell-1)}$ (Williams et al., 2020).
Residual Coding: Some variants, such as HR-VQVAE, focus on learning discrete representations of the residual between the original input and reconstructions from higher-level quantized codes, enabling each layer to encode complementary information (Adiban et al., 2022).
Codebook Structure: Each layer is associated with a discrete codebook. Entries of these codebooks (prototypical embedding vectors) are trained alongside the encoder and decoder weights. The codebooks are commonly of dimension $K_\ell \times d_\ell$ , where $K_\ell$ is the codebook size at layer ℓ (Takida et al., 2023, Duan et al., 2022).

Architecturally, each encoder and decoder is typically a small convolutional (often ResNet-style) network with 3×3 filters and stride-two downsampling or upsampling, as appropriate (Duan et al., 2023, Duan et al., 2022). The overall hierarchical latent structure forms a chain:

$x \leftrightarrow z^{(1)} \leftrightarrow z^{(2)} \leftrightarrow \dots \leftrightarrow z^{(L)}$

with no skip connections unless explicitly stated.

2. Quantization Mechanisms: Deterministic, Stochastic, and Bayesian Approaches

Quantization is central to these models, converting continuous latent representations into discrete codes for efficient compression and entropy modeling.

Deterministic Quantization: The simplest scheme (as in standard VQ-VAE) assigns each embedding to its nearest codebook entry using $\operatorname{argmin}_j\,\|z_e - e_j\|_2$ (Williams et al., 2020).
Stochastic Quantization: To promote codebook usage and enable differentiability, several works replace hard assignments with a soft (probabilistic) posterior:

$q(z=k|z_e) \propto \exp(-\|z_e-e_k\|_2^2)$

Samples can be drawn using Gumbel-Softmax relaxations (Williams et al., 2020, Willetts et al., 2020, Takida et al., 2023).

Bayesian Vector Quantization: HQ-VAE (Takida et al., 2023) introduces a hierarchical variational Bayes approach with stochastic vector quantization, where dequantization noise is modeled as $\mathcal{N}(\tilde{z}; z, s^2 I)$ (with $s^2$ learned) for continuous relaxation, and posteriors over codes are Gibbs-like:

$\hat{P}_{s^2}(z=b_k|\tilde{z}) \propto \exp \left(- \frac{\|\tilde{z} - b_k\|^2}{2 s^2}\right)$

Quantization via Additive Noise: For scalar quantization, as in (Duan et al., 2023, Duan et al., 2022), noise $\mathcal{U}(-\frac{1}{2}, \frac{1}{2})$ is added to the latent during training, with rounding applied at test time.

Practical quantization-aware training involves the straight-through estimator (STE) for backpropagation, or continuous relaxations that ensure low-variance gradients and smooth convergence.

3. Training Objectives and Entropy Terms

The total loss for quantization-aware hierarchical VAEs typically combines reconstruction loss, codebook/commitment losses, entropy-based regularization, and Kullback-Leibler (KL) divergence terms.

Reconstruction (Distortion): Negative log-likelihood or mean squared error between input $x$ and reconstructed $\hat{x}$ .
Quantization/Commitment Losses: Penalties encouraging encoder outputs to be close to selected codebook entries (and vice versa). For example:

$L_{\mathrm{quant}} = - H[q(z|z_e)] + \mathbb{E}_{q(z|z_e)}\|z_e - e_z\|_2^2$

in HQA (Williams et al., 2020).

Rate Terms: KL divergences between variational posteriors $q(z_\ell)$ and hierarchical priors $p(z_\ell|z_{>\ell})$ , providing an explicit rate-distortion trade-off (Duan et al., 2022, Duan et al., 2023, Takida et al., 2023).
Entropy Regularization: Negative entropy of the code-assignment posterior is included to encourage non-collapsed, high-perplexity codebook usage; this is done either explicitly (as in HQA) or emerges from the variational ELBO for discrete VAEs (Willetts et al., 2020, Takida et al., 2023).

In models with probabilistic or Bayesian quantization, temperatures or variances controlling quantization noise ( $s^2, \tau$ ) are learned and typically anneal towards zero during training, resulting in nearly discrete posteriors at convergence (Takida et al., 2023).

4. Codebook Utilization, Collapse Prevention, and Hierarchy Efficacy

Efficient codebook utilization is critical. Underutilization ("collapse") limits effective discrete capacity and degrades reconstructions.

Collapse Detection: Active monitoring uses code-assignment histograms, perplexity metrics $P = \exp H(Q(z))$ , and Lorenz curves (Reyhanian et al., 29 Jan 2026, Takida et al., 2023).
Mitigation Strategies:
- Probabilistic/entropy-regularized training, as in HQ-VAE and HQA, often suffices to maintain high utilization (Williams et al., 2020, Takida et al., 2023).
- Explicit codebook initialization from early encoder outputs and periodic reset of inactive codes further reduce collapse (Reyhanian et al., 29 Jan 2026).
- In hierarchical (especially residual) structures, each codebook is specialized to a scale or residual due to the architecture and loss decomposition (Adiban et al., 2022).
Hierarchy vs Flat Structure: "Is Hierarchical Quantization Essential for Optimal Reconstruction?" (Reyhanian et al., 29 Jan 2026) demonstrates empirically that—when capacity is matched and collapse is mitigated—hierarchical and single-level VQ-VAE variants achieve near-identical reconstruction fidelity. Hierarchical organization is not inherently superior in rate-distortion if codebook issues are controlled, but still offers advantages for generative modeling and maintaining perceptual sharpness with multi-scale codes.

5. Hierarchical Quantization Variants and Unified Formulations

Multiple architectural variants have been developed to exploit the benefits of hierarchical quantization.

Injected Hierarchy: Multi-resolution "injected" codes where latents at different scales are combined in the decoder. VQ-VAE-2 and SQ-VAE-2 are canonical examples; HQ-VAE recovers these as a special case of its Bayesian ELBO (Takida et al., 2023).
Residual Hierarchy: Each layer codes the residual left by decoding all previous codes. HR-VQVAE and RSQ-VAE fall in this category. This design reduces codebook search complexity to $O(nm)$ and avoids single codebook collapse, even for large codebooks (Adiban et al., 2022, Takida et al., 2023).
Relaxed-Responsibility Vector Quantization: RRVQ-VAE uses Gaussian kernels with learnable means and variances to parameterize code assignments, providing more stable mixing coefficients for Gumbel-Softmax sampling across deep stacks (up to 32 layers) (Willetts et al., 2020).
Quantization-aware Hierarchical VAEs for Compression: For image and audio compression, several works embed quantization-aware priors and posteriors directly into the VAE likelihood and use fully parallel arithmetic coding for high-throughput applications (Duan et al., 2023, Duan et al., 2022).

6. Empirical Results and Practical Impact

Hierarchical quantization-aware VAEs have demonstrated strong empirical performance across diverse datasets and tasks.

Reconstruction and Compression: Models such as HR-VQVAE and HQ-VAE improve FID and MSE over baseline VQ-VAE and VQ-VAE-2, both on images (e.g., ImageNet, CelebA, FFHQ) and audio (UrbanSound8K) (Adiban et al., 2022, Takida et al., 2023). For example, HR-VQVAE attains FID=1.26 (FFHQ) vs. 2.86 (VQVAE) and 1.92 (VQVAE-2) (Adiban et al., 2022).
Codebook Usage: Models with explicit entropy terms or Bayesian training (e.g., HQ-VAE) maintain high codebook perplexity and avoid manual resets, even as capacity scales (Takida et al., 2023).
Decoding Speed: Hierarchical and residual designs provide order-of-magnitude speedups in decoding by localizing codebook search and enabling parallel coding and decoding (Adiban et al., 2022, Duan et al., 2022).
Ablation Findings: Combining uniform noise training, strong entropy/commitment regularization, and hierarchical residual structures is critical for best rate-distortion performance and collapse prevention (Williams et al., 2020, Adiban et al., 2022).

7. Controversies and Best Practices in Architectural Design

Hierarchical quantization has been widely adopted, but its supposed intrinsic advantages for rate-distortion and reconstruction have been questioned.

Role of Hierarchy: When discrete and continuous representational budgets are matched, the improvement in reconstruction observed in hierarchical models is attributable to effective codebook utilization and not to hierarchy per se (Reyhanian et al., 29 Jan 2026).
When Hierarchy Matters: Hierarchical structures remain valuable for perceptual quality, structured generative modeling (multi-scale sampling), and for tasks necessitating multi-level abstraction (semantic image synthesis, controllable generation).
Best Practices: For optimal reconstructions, empirical guidance includes matching budgets when comparing architectures, actively monitoring codebook usage, preferring increases in codebook size over embedding dimensionality, and applying simple codebook maintenance strategies to prevent collapse (Reyhanian et al., 29 Jan 2026).

References:

"Hierarchical Quantized Autoencoders" (Williams et al., 2020)
"HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes" (Takida et al., 2023)
"Quantization-Aware ResNet VAE for Lossy Image Compression" (Duan et al., 2023)
"Relaxed-Responsibility Hierarchical Discrete VAEs" (Willetts et al., 2020)
"Hierarchical Residual Learning Based Vector Quantized Variational Autoencoder" (Adiban et al., 2022)
"Is Hierarchical Quantization Essential for Optimal Reconstruction?" (Reyhanian et al., 29 Jan 2026)
"Lossy Image Compression with Quantized Hierarchical VAEs" (Duan et al., 2022)