Improving the Diffusability of Autoencoders

Published 20 Feb 2025 in cs.CV, cs.AI, and cs.LG | (2502.14831v3)

Abstract: Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K $256^2$ and FVD by at least 44% for video generation on Kinetics-700 $17 \times 256^2$. The source code is available at https://github.com/snap-research/diffusability.

Abstract PDF Upgrade to Chat

Summary

The paper introduces scale equivariance regularization to reduce high-frequency imbalances in autoencoder latent spaces used by latent diffusion models.
It achieves significant quality improvements, including a 19% drop in FID for images and up to a 49% decrease in FVD for videos with minimal fine-tuning.
The method maintains or slightly improves reconstruction quality over standard KL regularization, leading to smoother denoising trajectories in diffusion processes.

This paper addresses the interaction between autoencoders (AEs) and diffusion backbones in Latent Diffusion Models (LDMs), focusing on an underexplored aspect called "diffusability"—how suitable an AE's latent space is for the diffusion process. The authors perform a spectral analysis using the Discrete Cosine Transform (DCT) on the latent spaces of modern AEs (like FluxAE, CosmosTokenizer, CogVideoX-AE, LTX-AE) and find they contain excessive high-frequency components compared to natural RGB images. This issue is more pronounced in AEs with larger bottleneck channel sizes, which are often used to improve reconstruction quality.

The core hypothesis is that these unnatural high-frequency components interfere with the inherent coarse-to-fine synthesis process of diffusion models, thereby degrading the final generation quality. The paper also shows that standard KL divergence regularization used in Variational Autoencoders (VAEs) can worsen this spectral imbalance by introducing more high-frequency noise.

To address this, the paper proposes a simple yet effective regularization strategy called Scale Equivariance (SE). The goal is to align the spectral properties of the latent space with the RGB space. This is achieved by enforcing scale equivariance in the AE's decoder during a short fine-tuning phase:

Both the input image $x$ and its corresponding latent representation $z = \Enc(x)$ are downsampled (e.g., using 2x or 4x bilinear downsampling) to get $\Tilde{x}$ and $\Tilde{z}$.
An additional reconstruction loss term is added to the AE training objective, penalizing the difference between the downsampled image $\Tilde{x}$ and the decoder's output from the downsampled latent, $\Dec(\Tilde{z})$.
The full loss function is:

$L(x) = d(x, \Dec(z)) + \alpha d( \Tilde{x}, \Dec(\Tilde{z}) ) + \beta L_\text{KL}$

where $d$ is a standard reconstruction loss (e.g., MSE + LPIPS), $\alpha$ controls the strength of the SE regularization (typically 0.25), and $\beta L_\text{KL}$ is the optional VAE KL term (often set to 0 when using SE).

This method requires minimal code changes and only a brief fine-tuning period for the AE (e.g., 10k-20k steps). Experiments show that SE fine-tuning effectively reduces the high-frequency components in the latent space, making its spectrum more similar to that of RGB images.

The effectiveness of SE regularization is demonstrated by training Diffusion Transformer (DiT) models on top of various AEs (both vanilla and fine-tuned with/without SE) for image (ImageNet-1K $256^2$ ) and video (Kinetics-700 $17 \times 256^2$ ) generation. Key results include:

Improved Generation Quality: Significant reductions in standard metrics are observed. For ImageNet $256^2$ , FID dropped by 19% for DiT-XL/2 using FluxAE+SE compared to vanilla FluxAE. For Kinetics-700, FVD decreased by at least 44% (e.g., CogVideoX-AE+SE showed a 49% FVD drop with DiT-XL/2).
Efficiency: The improvements are achieved with only short AE fine-tuning.
Reconstruction Preservation: Unlike strong KL regularization, SE regularization generally maintains or slightly improves AE reconstruction quality across metrics like PSNR, SSIM, and LPIPS, while significantly boosting downstream LDM performance.
Robustness: Visualizations confirm that LDMs trained with SE-regularized AEs exhibit smoother denoising trajectories with fewer high-frequency artifacts early on. AEs trained with SE also show better reconstruction when high-frequency components are deliberately removed from their latents.

In conclusion, the paper highlights the importance of latent space "diffusability" for LDMs and identifies spectral mismatch (excessive high frequencies) as a key issue in modern AEs. The proposed scale equivariance regularization offers a practical, efficient, and effective way to improve this spectral alignment, leading to substantial gains in the quality of LDM-generated images and videos.

Markdown Report Issue