Unified Latents (UL): How to train your latents

Published 19 Feb 2026 in cs.LG and cs.CV | (2602.17270v1)

Abstract: We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel co-training framework for deterministic encoders, diffusion priors, and decoders, enabling explicit latent bitrate control.
It employs a fixed noise schedule and reweighted ELBO loss to ensure stable training, superior sample quality, and high efficiency in ImageNet and Kinetics evaluations.
Empirical results demonstrate state-of-the-art FID and FVD scores, effectively addressing the reconstruction–generation trade-off in diffusion generative models.

Unified Latents: A Principled Approach to Latent Representation Learning for Diffusion Generative Models

Motivation and Framework

The paper "Unified Latents (UL): How to train your latents" (2602.17270) introduces a comprehensive framework for learning latent representations in generative modeling, particularly focusing on diffusion-based architectures. The core motivation is to resolve the trade-off inherent in latent diffusion models: balancing information density in latent spaces with ease of modeling and reconstruction fidelity. The UL framework uniquely co-trains a deterministic encoder, a diffusion prior, and a diffusion decoder, aligning the encoder's output noise with the prior's minimum noise level. This yields a rigorous training objective that provides interpretable control of the latent bitrate—an explicit upper bound on the information content—and addresses several shortcomings of prior approaches (VAE-style regularization, external pretrained encoders, arbitrary loss weighting).

Figure 2: Schematic overview of the Unified Latents model: Encoder ( $E_\theta$ ), prior latent diffusion model ( $P_\theta$ ), and diffusion decoder model ( $D_\theta$ ).

Methodology

Latent Encoding and Diffusion Prior

The encoder produces a deterministic latent vector ( $z_\mathrm{clean}$ ). Instead of learning the encoder noise (as in classic VAEs or LDMs), UL fixes the encoding precision by directly linking the final encoder noise to the minimum precision modeled by the diffusion prior. This is achieved via forward-noising the latent to a fixed log-SNR schedule ( $\lambda(0)$ ), and the prior diffusion model is trained to invert this noise. The prior regularization is computed via a weighted mean squared error across noise levels, yielding stable training and a tight estimate of information flow through the latent.

Figure 4: Unified Latents overview: An image is encoded to its latent, regularized via a diffusion prior, and decoded back via a diffusion decoder—controlling bitrate and ensuring regularization.

Decoder and Loss Weighting

The decoder is a diffusion model operating in the image space, conditioned both on the noisy image and the latent. Its reconstruction loss employs a reweighted ELBO (sigmoid weighting), preferentially modeling high-frequency features in the decoder for perceptual quality. A loss factor ( $c_\mathrm{lf}$ ) further lets the user tune the trade-off: increasing the factor yields better reconstructions and higher latent bitrate.

Figure 6: Decoder weighting on $-$ MSE loss, $w(\lambda_t) = c_\mathrm{lf}\cdot\operatorname{sigmoid}(b - \lambda_t)$ , showing noise levels where information is preferred to be modeled in the latent.

Stage-2 Base Model Training

After the encoder and decoder are trained, a base diffusion model is trained on the latents themselves, using a weighted loss. This stage enables increased batch/model sizes and improved sampling efficiency. The fixed noise schedule and explicit bitrate upper bound provide direct control over the latent information; the base model performance depends on optimal bitrate tuning for the downstream use-case.

Empirical Evaluation

Image and Video Generation Performance

UL is evaluated on ImageNet-512 and Kinetics-600. On ImageNet-512, UL achieves an FID of $1.4$ with high PSNR and lower training FLOPs than Stable Diffusion-based latents. On Kinetics-600, UL sets a new state-of-the-art FVD of $1.3$. These numbers are strong claims, indicating not only improved sample quality but superior training efficiency. Pre-training efficiency curves demonstrate that for a fixed compute budget, UL outperforms competing methods, including models trained on Stable Diffusion latents, semantically supervised latents, and discrete token-based compressors.

Latent Bitrate Tuning and Reconstruction Quality

Bitrate tuning experiments confirm the trade-off: low bitrate latents yield optimal generative FID for small models but lose fine reconstruction details. Larger models are less sensitive, achieving good FID even with higher bitrates.

Figure 1: Reconstruction quality vs loss factor: fine details such as small text are lost for low-bitrate latents.

Figure 3: Generation, reconstruction FID, and reconstruction PSNR vs bits/pixel, demonstrating the sensitivity of small models to latent bitrate and loss factor.

Text-to-Image Generation and Qualitative Samples

UL-based models trained on large text-to-image datasets outperform pixel diffusion and Stable Diffusion baselines on perceptual quality and text alignment, particularly at low-loss-factor settings.

Figure 5: A selection of samples from a text-to-image model trained with Unified Latents, illustrating visual diversity and high-fidelity generation.

Figure 9: Generations from the text-to-image model with various textual prompts and guidance, demonstrating alignment and sample diversity.

Figure 11: Image generation quality (left) and text alignment (right) vs AutoEncoder Loss Factor for different model sizes.

Ablations and Architectural Sensitivity

Ablation studies reveal that removing the diffusion prior or relaxing the encoder noise results in degraded gFID and instability, confirming the necessity of fixed noise and deterministic encoding. Regularization and decoder weighting are critical for stable training and high generative performance. The latent channel count and spatial downsampling are relatively insensitive, supporting the robustness of the UL framework. L2 loss regularization in the decoder yields inferior results—diffusion-based regularization is essential for optimal bitrate control.

Theoretical and Practical Implications

UL provides interpretable, stable control of the latent information flow, mitigating posterior collapse and instability endemic to classic VAE training with powerful decoders. This approach makes the reconstruction/modeling trade-off explicit and tunable, enabling principled design of generative models for both high-fidelity reconstruction and efficient sampling. The explicit bitrate tuning aligns with rate-distortion theory and offers systematic tools for adapting model scale, encoder architecture, and downstream task requirements.

UL’s framework generalizes beyond images and videos; with a discrete diffusion decoder, latents for text or other modalities may be learned under similar principles. It provides new handles for scaling laws—connecting model capacity, dataset complexity, latent information density, and empirical sample quality.

Limitations and Future Directions

Despite its advantages, diffusion decoders remain computationally more expensive to sample from compared to GANs, and cross-model comparisons are complicated by dataset and autoencoder training differences. Further work should formalize scaling laws for latent bitrate vs training cost, explore single-stage training with optimal weighting, and extend the approach to multimodal and discrete latent scenarios. Distillation for efficient sampling and improved generalization are promising avenues.

Conclusion

Unified Latents demonstrate that co-training a deterministic encoder, diffusion prior, and diffusion decoder establishes superior latent representations for generative modeling. This method achieves state-of-the-art efficiency and quality, provides explicit control over latent information, and defines a systematic approach for modeling the reconstruction–generation trade-off in latent diffusion models. As model and dataset scales grow, principled latent construction will become increasingly crucial for generative AI development.

Markdown Report Issue

Paper to Video (Beta)

All Videos Create Your Own

Whiteboard

Unified Latents (UL): How to train your latents

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces Unified Latents (UL), a new way to “compress” images and videos into a smart summary (called a latent) that a computer can easily work with to generate new, high-quality images and videos. UL focuses on learning these summaries so they:

carry just the right amount of information, and
are especially easy for powerful image/video generators, called diffusion models, to use.

The big idea is to train the compressor and the generator together, using the same kind of model (diffusion) on both sides. This makes training simpler, more stable, and more efficient.

What questions are the researchers trying to answer?

In simple terms:

How should we compress images and videos so that a diffusion model can generate the best-looking results?
How can we control how much detail the compressed summary holds, without guessing lots of tricky settings?
Can we get better quality with less training compute than previous methods?

How does their method work?

First, some quick meanings in everyday language:

Latent: Think of a latent as a short, clever summary of an image or video—like a blueprint that keeps the important parts without every tiny detail.
Diffusion model: Imagine starting with a blurry, noisy picture and teaching a model how to gradually remove the noise to recover a clean image. That’s diffusion. It’s great at generating realistic images and videos.
Noise level: How “foggy” or “noisy” the data is. Training is spread across different fog levels, from very foggy to almost clear.
Bitrate: How much information the latent carries. Higher bitrate = more detail, but harder for the generator to learn; lower bitrate = simpler to learn, but might miss fine details.

What UL does:

A shared language for both sides: UL uses diffusion models both to regularize the latent (the “prior”) and to turn it back into an image/video (the “decoder”). Because both sides use diffusion, they work together smoothly.
Fixed tiny noise in the latent: The encoder makes a clean latent, then UL adds a small, fixed amount of noise to it. This aligns perfectly with the prior diffusion model’s “minimum noise level.” This trick simplifies the math and gives a clear, reliable limit on how many bits the latent can carry.
Simple, controllable training: The decoder uses a gentle weighting that emphasizes important, visible structure over imperceptible tiny details. Two simple knobs control the trade-off:
- Loss factor: nudges how much detail goes into the latent versus the decoder.
- Sigmoid bias: shifts which noise levels matter more during training (like focusing more on big shapes vs. tiny textures).
Two stages for best results:
- Stage 1: Train encoder + diffusion prior + diffusion decoder together so the latent is compact and the decoder can reconstruct well.
- Stage 2: Freeze the encoder and train a bigger “base” diffusion model on the latents with a friendly weighting. This improves final sample quality a lot.

Analogy:

The encoder writes a blueprint (latent).
A safety inspector (the prior) checks that the blueprint follows simple, predictable rules.
A builder (the decoder) uses the blueprint to rebuild the full image or video.
By adding a fixed amount of “fog” to the blueprint and matching the inspector’s rules to that fog level, everyone “speaks the same language.” This keeps the blueprint compact but useful.

What did they find, and why does it matter?

Main results:

Images: On ImageNet at 512×512 resolution, UL achieves excellent FID around 1.4, while using fewer training FLOPs than models trained on standard Stable Diffusion latents. FID is a score for how realistic generated images look; lower is better.
Videos: On the Kinetics-600 video dataset, UL sets a new state of the art with FVD around 1.3. FVD is like FID, but for videos.
Efficiency: For the same amount of training compute, UL beats other methods on the quality-vs-cost curve. That means better results for less training time.
Control: By tuning just two simple settings (loss factor and bias), you can choose how much detail to store in the latent. That lets you balance “sharp reconstructions of a particular image” versus “easier, better generation overall.”
Robustness: UL works well across different latent sizes and shapes, and across images and videos. It’s not overly sensitive to specific design choices.

Why it matters:

Training large image and video models is expensive. UL helps get better results with less compute.
UL makes the trade-off between compression and generation clear and controllable, which is valuable when scaling to bigger models and datasets.

What does this mean for the future?

Better foundations: UL shows that if you design the compressed representation specifically for diffusion models, you can train faster and generate higher-quality images and videos.
Easy scaling: The simple controls (loss factor and bias) make it practical to choose the right latent bitrate for different model sizes and budgets.
Broader use: Although this paper focuses on images and videos, the same idea—learning compact, diffusion-friendly latents—could extend to other data types.
Practical note: Diffusion decoders are slower to sample than GAN decoders, so deploying UL at scale benefits from extra steps like distillation to speed things up.

In short, Unified Latents give a cleaner, more efficient way to learn the summaries that diffusion models use, leading to better-looking results with less training cost and with simple dials to tune detail versus learnability.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed to guide actionable future work.

Lack of a principled procedure to choose the decoder’s loss factor (c_lf) and sigmoid bias (b): hyperparameters control latent bitrate and the reconstruction–modeling split, but only empirical sweeps are provided; no algorithmic tuning rule or adaptive schedule tied to base-model capacity, dataset, or compute budget is given.
Fixed encoder precision choice is ad hoc: the final log-SNR λ(0)=5 is selected without an analysis of how to set or adapt it across tasks, datasets, or model sizes; the impact of λ(0) on the tightness of the bitrate bound and on training stability is not characterized.
Tightness and calibration of the “upper bound on latent bitrate” are not evaluated empirically: no experiments compare the bound to measured mutual information or compression-rate proxies, nor is the effect of retraining the prior as a weighted base model on this bound analyzed.
Single-stage training viability is unresolved: the paper states that a two-stage approach outperforms joint training but defers details to an appendix; a clear recipe and head-to-head comparison (quality, stability, compute) for single-stage vs. two-stage pipelines is missing.
Decoder posterior collapse avoidance lacks guarantees: increasing c_lf mitigates collapse empirically, but conditions under which collapse reappears (e.g., larger decoders, different schedules) and formal criteria to detect/prevent it are not established.
Generalization across domains is underexplored: AE trained on ImageNet vs. internal TTI datasets shows mixed rFID behavior; systematic cross-domain transfer studies (e.g., web-scale → ImageNet/ImageNet-V2/ImageNet-Real; TTI → COCO) are not presented.
Scaling laws are not provided: while the discussion calls for scaling laws to choose optimal latent bitrate for a given base model and compute, no empirical or theoretical scaling relation (e.g., bitrate vs. model FLOPs vs. FID/FVD) is reported.
Compute accounting is incomplete/inconsistent: training-cost plots explicitly exclude autoencoder training cost; it is unclear whether convolution FLOPs are counted (text mentions linear and attention FLOPs); end-to-end training and inference cost (base + decoder) is not compared against baselines on equal sampling steps.
Sampling speed–quality trade-offs are missing: no speed-quality curves (steps vs. FID/FVD) for base and diffusion decoder are shown, and comparisons to GAN decoders or distilled decoders on equivalent runtime budgets are absent.
Diffusion decoder cost remains high and undistilled: the paper acknowledges the decoder is ~10× costlier than GAN decoders; no attempt is reported to distill the diffusion decoder (or the combined base+decoder) and measure retained quality.
Text-to-image evaluation is limited and non-standard: results are on internal datasets with FID and CLIP scores only, without public benchmarks (e.g., COCO, PartiPrompts, DrawBench), human evaluation, guided vs. unguided comparisons, or alignment metrics beyond CLIP.
Decoder conditioning for TTI is speculative: the claim that text conditioning in the decoder might improve alignment is not tested; ablations on conditioning pathways, cross-attention placement, or shared vs. separate text encoders are absent.
Video scaling limits are untested: experiments use 16×128×128 clips with 5-frame conditioning; behavior at higher spatial resolutions, longer temporal horizons, and different conditioning regimes (e.g., unconditioned, text-conditioned, longer contexts) is not evaluated.
Robustness and OOD behavior are not studied: sensitivity to corruption, adversarial perturbations, or domain shifts (e.g., ImageNet-C, ImageNet-A) for both reconstruction and generation is not assessed.
Evaluation metric limitations are acknowledged but not addressed: rFID uses same-sample references, and FID’s high-frequency sensitivity is noted; alternative perceptual metrics (LPIPS, DISTS), human studies, or task-based evaluations are not provided.
Theoretical justification for latent-channel insensitivity is missing: empirical results show gFID is flat across large changes in latent channels and spatial size (within ranges), but there is no analysis explaining why, or guidelines for selecting latent shape under fixed compute.
Architectural sensitivity is only partially ablated: patching harms the base model, but systematic studies of patch size, convolutional stacks, UNet vs. ViT, UVit depth/width, and decoder/base regularization (beyond dropout) are lacking.
Learned-variance encoders are unstable but unexplained: the paper reports instability with learned encoder variance; it remains open whether alternative parameterizations (e.g., variance floors, log-variance clipping, normalizing flows, VAEs with hierarchical posteriors) can stabilize training and improve performance.
Prior/base weighting mismatch impacts theory: the prior is trained with unweighted ELBO to preserve the bitrate bound, but the base model is retrained with a sigmoid weighting; the theoretical implications for the bound and the encoded information allocation are not analyzed.
Information allocation between latent and decoder is heuristic: there is no framework quantifying how much signal should reside in the latent vs. be modeled by the decoder for a given base capacity and dataset; dynamic or learned allocation mechanisms are not explored.
Multi-modal extensions are not demonstrated: while the method is claimed to be broadly applicable (e.g., discrete decoders for text), no experiments on discrete data (text, tokens), audio, or cross-modal latents are provided.
Fairness and comparability of baselines need strengthening: some comparisons rely on internal implementations (e.g., SD-latent baselines with architectural changes), and public baselines’ FLOPs are normalized with assumptions (e.g., token count corrections); standardized protocols or re-runs on public code are not included.
Training dynamics of two simultaneous diffusion models are opaque: optimization difficulties are mentioned, and an L2-regularized decoder alternative is briefly tested; more systematic studies on optimizer choice, learning-rate schedules, update frequency, and alternating vs. joint updates are not provided.
Choice of diffusion schedules is fixed: the method uses variance-preserving schedules; effects of alternative schedules (VP vs. VE, logSNR ranges, noise distributions) on bitrate bounds, stability, and sample quality are not analyzed.
Long-term memorization and privacy are not addressed: given higher-fidelity reconstructions (higher PSNR) can encode more instance-specific details, risks of memorization or leakage (and mitigation via the bitrate control) are not evaluated.
Compression perspective is unexplored: although a bitrate bound is claimed, comparisons to learned compression baselines (rate–distortion curves, bpp vs. PSNR/MS-SSIM) and whether UL latents serve as practical codecs are not investigated.
End-to-end invertibility and determinism are not guaranteed: the diffusion decoder models a distribution; the trade-off between exact reconstructions (high PSNR) and generative diversity is not formalized, and knob(s) to control this at inference (e.g., temperature, guidance) are not studied.
Sensitivity to dataset curation and high-frequency statistics is unresolved: rFID differences when training AEs on different data suggest sensitivity to subtle statistics; there is no analysis of which frequency bands or features are emphasized by the decoder weighting and how that impacts downstream generation.
Public reproducibility is limited: key TTI datasets are internal; hyperparameter tables, training durations, and code for FLOP counting and metric computation (e.g., rFID protocol) are not provided, hindering independent verification.

View Paper Prompt View All Prompts

Practical Applications

Practical Applications of Unified Latents (UL)

Below are actionable, real-world applications derived from the paper’s findings, methods, and innovations. Each item includes sectors, potential tools/products/workflows, and feasibility notes.

Immediate Applications

UL-powered pretraining efficiency upgrades for image/video generation
- Sectors: software, media/entertainment, gaming, academia
- What to deploy: Replace Stable Diffusion (SD) latents with UL encoders + diffusion prior + diffusion decoder; adopt the two-stage training (prior+decoder, then base) with fixed encoder noise and unweighted prior ELBO; use FID/FVD vs FLOPs tracking
- Why: Better quality per training FLOP (e.g., competitive ImageNet-512 FID; SOTA Kinetics-600 FVD) and interpretable control of latent bitrate
- Dependencies/Assumptions: Access to GPU/TPU clusters; integration into existing training pipelines; diffusion decoders are more expensive to sample than GAN decoders; stage-2 base training recommended
Budget-aware latent bitrate tuning for model training
- Sectors: MLOps, academia, AI research labs
- What to deploy: Hyperparameter sweeps over decoder sigmoid bias and loss factor (c_lf) to target optimal bits-per-pixel/bpd for a given base model size; monitor gFID/rFID/PSNR and bpd to select operating points
- Why: Smaller base models benefit from lower bitrate latents; larger models tolerate higher bitrate while preserving generation quality
- Dependencies/Assumptions: Reliable metric logging (FID/FVD/PSNR/bpd); careful dataset alignment; automated sweep tooling
Synthetic data generation at state-of-the-art video/image quality for augmentation
- Sectors: robotics, autonomous driving, AR/VR, security/surveillance, sports analytics
- What to deploy: Use UL-trained base models to produce synthetic sequences/images for data augmentation and benchmarking; leverage conditional setups (e.g., frames-in, frames-out for video)
- Why: Improved FVD/FID at lower training cost helps expand datasets affordably while preserving realistic high-frequency details where desired
- Dependencies/Assumptions: Domain gap must be managed (potential fine-tuning); label integration; guidance for text/video conditioning where applicable
Generative dataset compression and archival using controllable UL latents
- Sectors: cloud storage, ML data management, media libraries
- What to deploy: Encode large datasets into UL latents with a target bitrate; store latents and reconstruct on demand via diffusion decoder; track bpd upper bounds as auditability for information content
- Why: Reduces storage while maintaining reconstruction quality (PSNR/rFID), with principled bitrate control via c_lf and loss weighting
- Dependencies/Assumptions: Decoder sampling cost is high; reconstructions must meet regulatory/quality requirements; performance sensitive to training dataset mismatches
Production text-to-image pipelines with improved quality-per-budget
- Sectors: marketing/creative studios, product design, e-commerce
- What to deploy: Train UL autoencoders on in-domain text-to-image corpora; deploy base models sized to budget; tune loss factor for perceptual quality vs text alignment; use guidance at inference
- Why: UL samples show better gFID than pixel diffusion and SD latents, with slightly better CLIP alignment; smaller models benefit most
- Dependencies/Assumptions: Dataset curation and prompt guidance; possible decoder-side text conditioning for alignment; legal/licensing compliance
Standardized reporting of training cost and latent bitrate in model releases
- Sectors: academia, policy, open-source ML
- What to deploy: Add FLOP-based training cost, latent bpd, rFID/gFID/FVD, and PSNR to model cards and benchmarks; use UL’s interpretable bitrate bound to improve transparency
- Why: Enables fair comparisons and energy-aware decision-making across methods
- Dependencies/Assumptions: Community adoption; consistent FLOP accounting; acceptance of rFID/gFID conventions
Edge/server split for content workflows (encode locally, decode centrally)
- Sectors: mobile apps, creator tools
- What to deploy: Lightweight UL encoder on device; upload small latent tensors; server-side diffusion decoding; optional post-processing
- Why: Reduces upstream bandwidth and storage while keeping high-quality output centralized
- Dependencies/Assumptions: On-device encoder feasibility; server compute for decoding; privacy controls and consent
Targeted control of high-frequency detail for privacy or aesthetics
- Sectors: privacy/compliance, content moderation, design
- What to deploy: Adjust decoder weighting and c_lf to suppress or preserve high-frequency details (e.g., small text, fine textures) during reconstruction or generation
- Why: Practical control over information density can help anonymize sensitive features or achieve desired aesthetics
- Dependencies/Assumptions: Human perception may differ from FID/PSNR; require policy guidance and domain validation

Long-Term Applications

Generative codecs for images and video (UL-based streaming)
- Sectors: media/telecom, cloud streaming, social platforms
- What to build: Standardized UL-style encoders with fast, distilled decoders for real-time playback; adaptive bitrate control using UL’s interpretable bpd
- Why: Potential to replace or complement traditional codecs with learned compression at higher perceptual quality
- Dependencies/Assumptions: Efficient decoders via distillation/acceleration; standardization and hardware support; robust QoS in diverse conditions
Multimodal unified latents (text, audio, 3D, sensor data)
- Sectors: foundation models, AR/VR, creative tooling
- What to build: Cross-modal UL autoencoders with diffusion priors/decoders (continuous or discrete) to compress and condition across modalities in a single generative stack
- Why: Harmonized latent spaces can simplify training and enable richer multimodal generation
- Dependencies/Assumptions: New architectures and loss designs; large-scale datasets; careful ELBO weighting for discrete decoders
Carbon-aware training and efficiency policy benchmarks
- Sectors: policy/regulation, sustainability, enterprise AI governance
- What to build: Industry benchmarks that tie reported FID/FVD to training zettaflops and latent bpd; guidelines for energy budgets and disclosure
- Why: UL provides clear levers for compute–quality trade-offs, supporting responsible AI development
- Dependencies/Assumptions: Broad stakeholder buy-in; reliable measurements and audits; legal frameworks
Privacy-preserving synthetic data releases with controllable information content
- Sectors: healthcare, finance, public sector
- What to build: Pipelines that cap latent bitrate to reduce memorization risk; integrate differential privacy; release UL latents or generated datasets with documented bpd bounds
- Why: Enables data sharing while mitigating privacy risks
- Dependencies/Assumptions: Formal privacy guarantees required; domain-specific validation; regulatory acceptance
Autoencoder-as-a-service (AEaaS) for domain-specific generative stacks
- Sectors: SaaS/MLOps, cloud providers
- What to build: Managed UL autoencoder/prior/decoder training and hosting; customers get latents and base models tailored to their data and budgets
- Why: Reduces integration friction and ensures expert tuning of bitrate vs quality
- Dependencies/Assumptions: Secure data handling; SLAs for reconstruction/generation quality; scalable compute
Scaling-law-driven autotuning of latent bitrate and base model size
- Sectors: MLOps, academia
- What to build: Predictive models that select c_lf, sigmoid bias, channels/downsampling based on target budget and task; integrate with training orchestration
- Why: Automates a critical trade-off and avoids costly manual sweeps
- Dependencies/Assumptions: Robust, generalizable scaling laws across datasets and architectures
Fast decoders via distillation or alternative generative families
- Sectors: real-time applications, edge computing
- What to build: Distilled diffusion decoders, or flow/GAN-based decoders that preserve UL’s bitrate control while reducing sampling cost
- Why: Addresses a key limitation (decoder cost) enabling broader deployment
- Dependencies/Assumptions: Maintain sample quality and avoid mode collapse; hardware-aware acceleration
On-device generative compression for robots and IoT
- Sectors: robotics, autonomous systems, industrial IoT
- What to build: UL encoders on sensors/robots to compress streams; central servers decode for monitoring/simulation; use bitrate control for bandwidth budgeting
- Why: Reduces transmission load while preserving reconstructable details for situational awareness and training
- Dependencies/Assumptions: Resource-constrained deployments require efficient encoders; low-latency decoding; safety and reliability testing
Federated generative training with UL latents
- Sectors: mobile, privacy-first analytics
- What to build: Clients encode data locally to UL latents; central aggregation trains base models without raw data; secure aggregation and audit trails via bpd reporting
- Why: Reduces data movement and improves privacy while enabling high-quality generative training
- Dependencies/Assumptions: Robust federated protocols; trust and compliance; consistent latent distributions across clients
Standardized benchmarking suites and open-source references for UL
- Sectors: academia, industry consortia
- What to build: Public implementations with reproducible configs, dataset splits, and metric reporting (FID/FVD/rFID/gFID/PSNR/bpd/FLOPs)
- Why: Facilitates fair comparison and accelerates community adoption
- Dependencies/Assumptions: Funding and maintenance; legal access to datasets; contribution governance

View Paper Prompt View All Prompts

Glossary

Auto-encoder: A neural network that compresses inputs into latents and reconstructs them; here, it provides the latent space for diffusion modeling. "Previous work like Stable Diffusion uses a auto-encoder that is trained on another dataset than ImageNet."
base model: The generative diffusion model trained in stage 2 on latents to produce samples, typically with reweighted objectives. "Because only a frozen encoder is required during this stage, the base model size and batch size can be much larger than in stage 1."
bits per dimension (bpd): A measure of information content per latent dimension, indicating compression efficiency and capacity. "Additionally, because our models provide an upper bound on latent information, we report the estimated bits per dimension (bpd) in the latent space."
bits/pixel: Latent bitrate measured per image pixel, used to relate information content to reconstruction and generation quality. "Image quality for various latent bitrates (FID vs bits/pixel) for a small model variant."
conditioning: Providing auxiliary inputs (e.g., latents) to guide a model’s predictions. "The key distinction is that the decoder network $D_\theta = \hat{}(_t, _0, \theta)$ conditions on both the noisy data $_t$ and the latent $_0$ ."
decoder weighting: The noise-level-dependent weighting applied to the decoder’s diffusion ELBO to emphasize perceptually important frequencies. "Decoder weighting on $%%%%3%%%%w_(\lambda_t) = c_\mathrm{lf} \cdot \operatorname{sigmoid}(b - \lambda_t)%%%%4%%%% c_\mathrm{lf} = 1.6$ in this case) and which noise levels are discounted."
destruction process: The forward noising process in diffusion models that corrupts data according to a schedule. "Consider a data distribution $q(x)$ and a destruction process $_t = \alpha(t) x + \sigma(t) \epsilon$ with $\epsilon \sim \mathcal{N}(0, 1)$ ."
diagonal noise: Gaussian noise with a diagonal covariance used to sample latents at a fixed precision. "Thus, the latent $z_0$ is sampled using a learned mean and fixed diagonal noise."
diffusion decoder: A diffusion model used as the decoder to reconstruct data from latents. "This $_0$ is then used by a diffusion decoder to reconstruct the image."
diffusion models: Generative models that learn to reverse a gradual noising process to sample from complex continuous distributions. "Diffusion models have become remarkably successful for image, video, and audio generation."
diffusion prior: A diffusion model that serves as the latent prior, regularizing the encoder by modeling latents with a controlled noise level. "A diffusion prior models the path from pure noise $_1$ to a slightly noisy latent $_0$ ."
distillation: A procedure to reduce sampling cost or model size by transferring knowledge to a more efficient model. "Without an additional distillation step for the decoder, the computational cost of using Unified Latents is significantly higher than a standard LDM."
ELBO (Evidence Lower Bound): A variational objective that lower-bounds the log-likelihood, used to train latent-variable models. "Evidence Lower Bound (ELBO) on the log-likelihood when using a latent variable $_0$ :"
encoder entropy term: An additional term accounting for the entropy of the encoder distribution in diffusion-prior VAEs, often causing instability. "requires a separate encoder entropy term $\mathbb{E}_{q(_0 | )} \log q(_0 | )$ that introduces training instability."
FID (Fréchet Inception Distance): A metric for image generation quality based on distributional distance in a feature space. "On ImageNet-512, our approach achieves competitive FID of $1.4$, with high reconstruction quality (PSNR) while requiring fewer training FLOPs..."
FLOPs (floating-point operations): A compute metric counting floating-point operations, used to compare training and inference cost. "For computational cost, we count FLOPs for all linear projections and attention operations."
forward-noised: Applying the diffusion forward process to a latent to reach a target noise level/time. "the encoder predicts a single deterministic latent $z_{\text{clean}$, which is then forward-noised to time $t = 0$ ."
FVD (Fréchet Video Distance): A metric for video generation quality based on distributional distance of video features. "On Kinetics-600, we set a new state-of-the-art FVD of $1.3$."
gFID: FID computed on samples from the base model (generation FID), used to evaluate generative performance. "When sampling from a base model we denote the FID as gFID."
GAN: Generative Adversarial Network; used in some latent diffusion approaches to train autoencoders with adversarial losses. "The original Latent Diffusion Model~\citep{rombach2022highresolution} uses a GAN-trained autoencoder with channel-bottlenecked latents..."
gigaflops (GFlops): A measure of model complexity or cost equal to billions of FLOPs. "Model complexity (gigaflops per evaluation), ImageNet-512"
guidance: A sampling technique (e.g., classifier-free guidance) that improves alignment or fidelity by steering generation. "text-alignment can be easily improved by applying guidance."
Kinetics-600: A large-scale video dataset for action recognition, used here to evaluate video generation. "On Kinetics-600, we set a new state-of-the-art FVD of $1.3$."
KL divergence (KL term): A divergence used to regularize the latent distribution toward a prior; central in VAE objectives. "Since the decoder lacks a likelihood-based loss, the weight of the KL term must be set manually, making it difficult to reason about the information content of the latents."
latent bitrate: The information rate of the latent representation, controlling the reconstruction–generation trade-off. "we obtain a simple training objective that provides a tight upper bound on the latent bitrate."
latent channels: The number of channels in the latent tensor, determining information capacity and reconstruction potential. "The number of latent channels therefore determines the information capacity: fewer channels yield easier-to-model latents at the cost of reconstruction quality, while more channels enable near-perfect reconstruction but require greater modeling capacity."
Latent Diffusion Model (LDM): A diffusion framework operating in a learned latent space rather than pixels. "The original Latent Diffusion Model~\citep{rombach2022highresolution} uses a GAN-trained autoencoder..."
latent representations: Compact encodings of data used to enable efficient modeling and sampling at higher resolutions. "We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model."
logsnr schedule: A parameterization of the noise schedule using the log signal-to-noise ratio, controlling diffusion noise levels. "The level of destruction is defined by the logsnr schedule $\lambda(t) = \log( \alpha_t^2 / \sigma_t^2 )$ ."
loss factor: A scalar that up-weights the decoder’s loss (equivalently down-weights the KL term) to prevent posterior collapse and tune bitrate. "For that reason, we up-weigh the decoder loss with a loss factor (which is equivalent to down-weighting the KL-term)."
minimum noise level: The smallest noise level in the schedule; aligning the prior to it simplifies the KL and stabilizes training. "Align the prior diffusion model with the minimum noise level."
mode-collapsing (mode collapse): A GAN training pathology where generated outputs lose diversity by collapsing to few modes. "However, the mode-collapsing nature of GAN training might help this class of models producing better looking images with better rFID scores."
patching: Processing inputs by dividing them into patches to reduce compute and memory costs. "Our encoder and decoder models use 2x2 patching to save compute."
posterior collapse: A VAE failure mode where a powerful decoder ignores the latents, reducing them to carrying little or no information. "Even with equal weighting, literature has shown that it is difficult to use the latent space in VAEs when the decoder is powerful, a phenomenon referred to as posterior collapse"
PSNR (Peak Signal-to-Noise Ratio): A reconstruction quality metric comparing originals and reconstructions; higher is better. "We also use PSNR (Peak Signal-to-Noise Ratio) to measure how closely reconstructions match their originals."
re-weighted ELBO: An ELBO modified with a noise-level-dependent weighting to favor perceptually important regions. "This re-weighted ELBO formulation has the added benefit that the weighting is invariant to the choice of schedule $\lambda(t)$ "
rFID: FID computed on autoencoder reconstructions (reconstruction FID), indicating fidelity of reconstructions. "For reconstruction we use the term rFID and use the same samples from the dataset to compute reconstructions and the FID references."
sigmoid loss: A specific weighting function over noise levels for diffusion training, typically sigmoid(λ − b). "we use the sigmoid loss \citep{kingma2023understandingdiffusion_vdmplus, hoogeboom2024sid2}, $w(\lambda(t)) = \operatorname{sigmoid}(\lambda(t) - b)$ ."
Unified Latents (UL): The paper’s method that co-trains a deterministic encoder with a diffusion prior and a diffusion decoder, linking encoder noise to prior precision. "We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model."
UNet (SD): A U-Net architecture variant used in Stable Diffusion baselines for modeling latents. "The UNet (SD) baseline is a small model that uses an additional convolution stack instead of patching the SD latents."
UVit: A U-shaped Vision Transformer architecture used as the diffusion decoder in this work. "The decoder is a UVit model \citep{hoogeboom2024sid2} with channel counts [128, 256, 512]..."
variance-preserving noise schedule: A diffusion schedule with α² + σ² = 1, keeping variance constant across noise levels. "Additionally, we use $\alpha_t^2 + \sigma_t^2 = 1$ for convenience."
ViT (Vision Transformer): A transformer architecture operating on visual patches, used here for the prior and base models. "The prior model is a single level ViT with 8 blocks and 1024 channels."
zettaflops: A unit of compute equal to 10²¹ FLOPs, used to measure large-scale training cost. "Training cost (zettaflops per model)"

Unified Latents (UL): How to train your latents

Summary

Unified Latents: A Principled Approach to Latent Representation Learning for Diffusion Generative Models

Motivation and Framework

Methodology

Latent Encoding and Diffusion Prior

Decoder and Loss Weighting

Stage-2 Base Model Training

Empirical Evaluation

Image and Video Generation Performance

Latent Bitrate Tuning and Reconstruction Quality

Text-to-Image Generation and Qualitative Samples

Ablations and Architectural Sensitivity

Theoretical and Practical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are the researchers trying to answer?

How does their method work?

What did they find, and why does it matter?

What does this mean for the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications of Unified Latents (UL)

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Authors (4)

Collections

Tweets

HackerNews

Unified Latents (UL): How to train your latents

Summary

Unified Latents: A Principled Approach to Latent Representation Learning for Diffusion Generative Models

Motivation and Framework

Methodology

Latent Encoding and Diffusion Prior

Decoder and Loss Weighting

Stage-2 Base Model Training

Empirical Evaluation

Image and Video Generation Performance

Latent Bitrate Tuning and Reconstruction Quality

Text-to-Image Generation and Qualitative Samples

Ablations and Architectural Sensitivity

Theoretical and Practical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are the researchers trying to answer?

How does their method work?

What did they find, and why does it matter?

What does this mean for the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Practical Applications of Unified Latents (UL)

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets

HackerNews