How best to learn latent representations for diffusion models

Determine the optimal approach for learning latent representations that will be modeled by diffusion generative models, specifying training and regularization strategies that yield high-quality generation while maintaining interpretable control over latent information content.

Background

Latent representations enable diffusion models to scale efficiently to high-resolution image, video, and audio generation. The original Latent Diffusion Model employs a VAE-style KL penalty with a manually chosen weight, which complicates reasoning about the latent information content and bitrate. Recent alternatives use semantically focused or heavily regularized autoencoders that simplify training but often lose high-frequency details and reconstruction fidelity.

The paper highlights a fundamental trade-off between latent information density and decoder reconstruction quality, motivating the need for principled latent learning methods. Unified Latents (UL) is proposed as a partial answer, co-training a diffusion prior and diffusion decoder and linking encoder noise to the prior’s minimum noise level to provide an interpretable bitrate bound. The explicit statement in the introduction acknowledges that, despite such efforts, the broader question of how best to learn latents remains unresolved.

References

However, it remains unclear how best to learn such latents.

Unified Latents (UL): How to train your latents  (2602.17270 - Heek et al., 19 Feb 2026) in Section 1 (Introduction)