Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation
Abstract: Latent diffusion models excel at generating high-quality images but lose the benefits of end-to-end modeling. They discard information during image encoding, require a separately trained decoder, and model an auxiliary distribution to the raw data. In this paper, we propose Latent Forcing, a simple modification to existing architectures that achieves the efficiency of latent diffusion while operating on raw natural images. Our approach orders the denoising trajectory by jointly processing latents and pixels with separately tuned noise schedules. This allows the latents to act as a scratchpad for intermediate computation before high-frequency pixel features are generated. We find that the order of conditioning signals is critical, and we analyze this to explain differences between REPA distillation in the tokenizer and the diffusion model, conditional versus unconditional generation, and how tokenizer reconstruction quality relates to diffusability. Applied to ImageNet, Latent Forcing achieves a new state-of-the-art for diffusion transformer-based pixel generation at our compute scale.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Latent Forcing: A simple explanation
What is this paper about?
This paper is about making AI image generators faster and better without throwing away important image details. The authors introduce a new method called Latent Forcing that keeps the benefits of “latent” methods (which are efficient) while directly generating full images (which is cleaner and more accurate).
Think of drawing a picture: you usually sketch the rough shapes first, then add fine details. Latent Forcing teaches an image model to do something similar—first figure out the big-picture structure, then fill in the tiny pixel details.
What are the main questions the paper asks?
The paper focuses on three simple questions:
- Can we generate images efficiently without “compressing away” important information first?
- Does the order in which the AI uncovers information matter (big structure first, tiny details later)?
- Can we get the best of both worlds: the speed of latent methods and the accuracy of direct pixel methods?
How does the method work? (In everyday terms)
Many modern image AIs work in two different spaces:
- Pixel space: the actual image pixels (like the colored dots on your screen).
- Latent space: a compact, smarter summary of the image (like a rough sketch or notes about the image’s structure).
Traditional “latent diffusion” compresses images into latents, learns to generate in that simpler space, and then uses a separate decoder to turn latents back into pixels. This can be efficient, but it can also lose important details (like faces or text) and requires extra moving parts.
Latent Forcing takes a different route:
- It trains a single model to work with both latents and pixels at the same time.
- It uses two “time dials” (think of them as blur-to-sharp sliders), one for latents and one for pixels.
- During generation, it turns the latent dial from blurry to clear first (to get the big structure right), and then turns the pixel dial (to add crisp details).
- The latent is just a “scratchpad.” Once the final image is ready, the latent is thrown away.
Analogy:
- Imagine building a Lego city. First, you place the big blocks to outline the streets and buildings (latents). Then you add windows, doors, and small decorations (pixels). Two dials control the process: one for when the big blocks become clear, one for when the tiny pieces come into play.
Under the hood (kept simple):
- The model is a standard diffusion transformer (a popular architecture for image generation).
- The authors only add a second time embedding (the second “dial”) and optionally split the last few layers so the model can specialize outputs for latents and pixels. These are tiny changes.
- They test different “schedules” (ways to turn the two dials over time), including:
- Cascaded: fully clarify latents first, then pixels.
- Joint: clarify both, with latents moving ahead of pixels.
- They measure quality with FID, a common image-quality score (lower is better).
What did they find, and why is it important?
Here are the key takeaways:
- The order matters a lot. Generating the latent structure first and the pixels second clearly improves image quality over doing pixels first or doing pixels alone.
- It works across different latent types. Whether the latent features come from DINOv2 or Data2Vec2 (two popular self-supervised vision models), the “latents-first” order helps.
- It beats strong baselines. On ImageNet (a large standard image dataset), Latent Forcing sets a new state-of-the-art for pixel-space transformer generators at their compute scale—both when using labels (conditional) and without labels (unconditional).
- It simplifies the pipeline. You don’t need a separate decoder that can lose information. The model directly produces the final pixels and keeps all the details intact.
- It explains past results. The paper shows that some previous tricks that used pretrained features (like REPA) likely worked so well partly because of the ordering of information, not just because of the features themselves.
In short: You don’t have to throw away detail (via heavy compression) to make image generation easier. If you reveal the right kind of information first (the structure), generating the final pixels becomes both easier and better.
What could this mean for the future?
- Simpler, more accurate generators: Models may no longer need complicated tokenizers and decoders. A single end-to-end model could be enough.
- Better details without trade-offs: Because you don’t discard information up front, things people care about—like faces and text—can be preserved while still getting high-quality generations.
- A general lesson: Ordering the flow of information—what the model learns first, second, and so on—might be as important as the model architecture. This idea could help not just image generators, but also text, audio, and video models.
- Scaling up: Since the method makes minimal changes and uses proven components, it’s practical to scale on bigger datasets and models.
Overall, Latent Forcing suggests a simple but powerful idea: teach the model to “think in rough shapes first, then paint the tiny details,” and do it all in one place, directly in pixel space. This leads to better images and a cleaner design.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of concrete gaps and unresolved questions that future work could address:
- Dataset and task scope: Results are limited to ImageNet-256 with class-conditional and unconditional setups; transfer to higher resolutions (e.g., 512, 1024), diverse domains (e.g., faces, medical, satellite), and other tasks (text-to-image, editing, inpainting, compositional control) is untested.
- Scalability with resolution: It is unclear how dual-modality denoising scales in memory, training time, and sample quality as image resolution and model size increase.
- Compute and fairness: The paper states “at our compute scale” but does not report comparable FLOPs/throughput or wall-clock for training and sampling versus baselines, nor the cost of precomputing and storing latent targets during training.
- Latent choice dependence: Only a few deterministic latents (DINOv2, Data2Vec2, 64×64 pixels) are evaluated; sensitivity to other SSL encoders, layers, multi-scale features, CLIP-like semantics, or frequency-domain features remains unknown.
- End-to-end latent learning: Latents are fixed, externally pretrained features; whether jointly learning the latent extractor with the diffusion model improves performance, stability, or robustness is unexplored.
- Multi-tokenizer generality: Although the formulation supports k > 2 modalities, experiments use k = 2; benefits, conflicts, and scheduling with more modalities (e.g., text, depth, segmentation, audio) are unexamined.
- Fusion design: Token-level fusion is limited to additive embedding summation; alternatives (concatenation with projection, cross-attention, gated fusion, MoE across depth) and their impact on capacity and disentanglement are not studied.
- Ordering optimality theory: SNR-based ordering is motivated but not theoretically characterized for optimality or sample complexity; no guarantees relate ordering to learnability beyond empirical trends.
- Schedule learning: Time schedules are hand-designed (cascaded, variance shift, linear offset); learning schedules (per-sample or globally), dynamic switching criteria, or controller policies is an open direction.
- Stability of “noise on latents” trick: Adding small latent noise during pixel steps helps training but harms inference; the mechanism, optimal magnitude, and principled regularization alternatives remain unclear.
- Cascaded error mitigation at inference: Only training-time mitigations are explored; methods to reduce cascaded error at inference (e.g., iterative latent refinement, joint re-updating latents late in sampling) are not evaluated.
- Loss weighting and normalization: The paper equalizes loss magnitudes and matches global variance; principled per-channel/feature whitening, adaptive loss reweighting, or uncertainty-aware weighting is not investigated.
- Metric coverage: Evaluation focuses on FID; effects on precision/recall, diversity (mode coverage), CLIP score, human perceptual studies, semantic fidelity (e.g., fine text, faces), and robustness are missing.
- Likelihood and calibration: Despite the flow-based framing, no likelihood, bits-per-dimension, or calibration analyses are reported; it is unknown whether the joint objective biases modeling of P(X) versus P(X,Y) in practice.
- Generalization and robustness: OOD behavior, long-tail classes, adversarial/noisy inputs during sampling, and robustness to distribution shifts are unassessed.
- Guidance interactions: AutoGuidance and CFG-Interval are used, but sensitivity to guidance strengths, schedules, and combinations across modalities is not systematically analyzed.
- Training dynamics and late-stage degradation: Some configurations degrade with longer training; the causes (overfitting to latent details, optimization pathologies) and regularizers to prevent this are unresolved.
- Architectural capacity split: The optional “output experts” split only the last 4 layers without added parameters; optimal depth/width allocation, gating, or full MoE for modality-specific heads is unexplored.
- Patch alignment artifacts: DINOv2 uses 224×224/14×14 patches, while pixels use 256×256/16×16; the impact of resizing/interpolation on alignment and quality, and alternatives for exact co-alignment, are not quantified.
- Hyperparameter sensitivity: Performance depends on tclip, platent, schedule parameters (e.g., a, offsets), and timestep distributions; robustness ranges and automated tuning strategies are not provided.
- Sample efficiency and speed: Sampling uses 50 Heun steps; compatibility with fast samplers (e.g., progressive distillation, consistency models) and step-count–quality trade-offs for dual schedules are not explored.
- Bias in pretrained latents: Imported biases from DINOv2/Data2Vec2 may steer generation; effects on fairness, content bias, and controllability are not measured.
- Frequency vs semantic ordering: The method enforces semantic-first via latents, but the interaction with frequency-domain generation (e.g., spectral autoregression) and hybrid frequency/semantic orderings remains open.
- Joint vs cascaded denoising: Cascaded scheduling wins here, yet joint denoising shows promise; when joint scheduling is preferable, and how to tailor it for different latents/tasks, is unclear.
- Extension to conditioning beyond classes: Integrating rich conditioning (text, layout, style) as additional time-scheduled modalities, and their interaction with latent-forced ordering, is untested.
- Diversity–fidelity trade-off: Whether semantic-first ordering reduces diversity or causes mode bias is unquantified; precision–recall curves could clarify trade-offs.
- Memory/storage footprint: Training requires storing or recomputing latent targets; the memory/I/O implications at scale and mitigations (on-the-fly computation, caching policies) are not discussed.
- Reproducibility details: Some ablation settings (e.g., schedules, exact guidance parameters) are summarized but not exhaustively specified; a full recipe for exact reproduction is not evident.
- Safety and misuse: No discussion of safety filters, watermarking, or misuse risks; how latent forcing impacts controllability and safety tools is unknown.
Practical Applications
Immediate Applications
Below are actionable use cases that can be deployed or prototyped with current tools and compute, derived from the paper’s method (Latent Forcing), findings on ordering, and training/inference insights.
- Upgrade existing pixel-space diffusion systems with order-aware scheduling to improve image quality
- Sector: software/AI, creative tools
- What: Incorporate a “latent-first, pixel-later” denoising trajectory in DiT/JiT-based image generators; add a second time embedding and modality-specific schedules; optionally split final layers into pixel/latent experts; mix AutoGuidance and CFG-Interval by modality.
- Tools/workflows:
- Hugging Face/Diffusers or internal DiT pipelines with an extra time-MLP and merged latent+pixel tokens
- Schedule tuner to search cascaded vs variance-shift schedules; use Multi-Schedule training for exploration and Single-Schedule for deployment
- Assumptions/dependencies: Quality gains shown on ImageNet-256; requires pretrained latent encoder choice (e.g., DINOv2 or domain-specific alternative); some hyperparameter sensitivity (platent, tclip, early-step sampling).
- Simplify inference pipelines by removing separate decoders (lossless pixel-space generation with latent “scratchpad”)
- Sector: software/AI, enterprise infra
- What: Replace VAE/VQ decoders with a single DiT that generates pixels directly while using internally generated latents as scratchpad—reduces engineering and runtime complexity.
- Tools/workflows: Consolidated model artifact (no external decoder), unified quantization and serving; simplified model update/rollout.
- Assumptions/dependencies: Training still needs latent embeddings during training (via a self-supervised encoder); inference-time performance/latency subject to transformer size and step count.
- Higher-fidelity synthetic data generation for vision pretraining and benchmarking
- Sector: robotics, autonomy, retail/ads, web-scale vision
- What: Use latent-first generation to produce images with improved large-scale structure, aiding dataset quality (e.g., class-consistent silhouettes and layouts) for pretraining or augmentation.
- Tools/workflows: Class-conditional/unconditional generation with cascaded schedule; integrate AutoGuidance for later pixel steps to improve sample quality.
- Assumptions/dependencies: Benefits demonstrated at 256×256; for task-specific realism, swap in domain encoders (e.g., DINOv2->domain SSL) and retune schedules.
- Mid-trajectory safety and quality control via latent “checkpoints”
- Sector: policy/safety, consumer apps, platform integrity
- What: Because semantics emerge in the latent phase, run content moderation, NSFW/brand-compliance filters, or steerable controls before rendering high-frequency pixel detail; abort or adjust generation early.
- Tools/workflows: Hook moderation classifiers to latent tokens/time steps; apply guidance only during pixel steps; log latent snapshots for audit/debugging.
- Assumptions/dependencies: Requires appropriate classifiers on latent representations; governance for logging intermediate states; careful latency budgeting.
- Rapid prototyping and pedagogy for diffusion research: ordering as a first-class variable
- Sector: academia/education
- What: Use Multi-Schedule models to study how ordering affects diffusability, conditional vs unconditional generation, and representation choices (DINOv2 vs D2V2).
- Tools/workflows: Open-source codebase; schedule search and FID/PSNR grids across modality timesteps; curriculum for advanced ML courses.
- Assumptions/dependencies: Compute availability for ablations; metrics (FID-10K/50K) and datasets (ImageNet) accessible.
- Domain-adaptable training by swapping latent encoders without changing inference architecture
- Sector: healthcare, industrial inspection, mapping/remote sensing
- What: Train with a domain-specific self-supervised encoder providing aligned latent patches; apply same inference engine (no external decoder) for better structural faithfulness in niche domains.
- Tools/workflows: Precompute latents from a domain SSL backbone; align patch grids; cascaded schedule favoring latents early.
- Assumptions/dependencies: Availability of high-quality domain SSL models; regulatory review for synthetic medical imagery; need to validate clinical fidelity.
- Operational efficiency improvements in MLOps
- Sector: enterprise ML platforms
- What: Fewer moving parts than latent diffusion (no encoder/decoder in production), easier versioning, simpler failure modes; reduced dependency on third-party decoder licenses.
- Tools/workflows: Single-asset model management; inference monitoring on per-modality guidance; A/B testing different schedules.
- Assumptions/dependencies: May not reduce GPU time vs latent diffusion at scale today; gains are primarily simplification and quality.
Long-Term Applications
These require further research, scaling, domain adaptation, or engineering to generalize beyond the paper’s ImageNet setup.
- Text-to-image and multi-conditional generation with latent “scratchpads”
- Sector: creative tools, media, advertising
- What: Generate semantic latents (e.g., CLIP/DINO-style or text-derived representations) first, then pixels—potentially improving prompt fidelity and reducing artifacts (e.g., faces/text).
- Tools/products: “Order-aware” text-to-image engines; prompt-to-latent composer modules; mid-trajectory editing GUIs.
- Assumptions/dependencies: Need to validate with text conditioning (not covered in paper); robust alignment losses and schedule design across text and vision tokenizers.
- High-resolution and video generation with multi-time schedules
- Sector: film/VFX, gaming, AR/VR
- What: Extend cascaded scheduling to pyramids (e.g., 64→256→1024) and to video by separating motion/structure latents from per-frame pixel detail; denoise structure first to stabilize temporal coherence.
- Tools/products: Order-aware video generators; hierarchical schedulers; temporal latent encoders (e.g., trajectory features).
- Assumptions/dependencies: Scaling laws and training stability at 1024+ and for long sequences; large compute; motion-specific latent encoders.
- Joint generation of pixels and structured scene latents (depth, segmentation, normals)
- Sector: robotics, autonomous driving, digital twins, 3D vision
- What: Co-generate scene-consistent auxiliary maps as early latents (structure first), then render photorealistic pixels; improves controllability and utility for simulation/synthetic data.
- Tools/products: Sim2Real content engines; multi-head DiTs with depth/seg/pixel outputs; controllable knobs over structure latents.
- Assumptions/dependencies: Need reliable self-supervised structured encoders; calibration to ensure alignment between maps and final images; evaluation protocols for consistency.
- Generative compression and photo storage with near-lossless semantics-first decoding
- Sector: consumer cloud, mobile, imaging
- What: Store compact structural latents and regenerate pixels with order-aware diffusion; prioritize semantics at low bitrates and refine details on demand.
- Tools/products: “Scratchpad-first” codecs; progressive decoding apps (quick semantic preview → detailed render).
- Assumptions/dependencies: Rate–distortion and latency performance vs modern codecs; on-device efficiency of DiT inference; standardization.
- Interactive, controllable editing via latent-phase manipulation
- Sector: creative suites, design/CAD, e-commerce
- What: Edit scenes by modifying early-stage latents (layout, object presence) before committing to high-frequency detail; supports deterministic resets and localized changes.
- Tools/products: Latent timeline editors; constraint solvers that nudge latents during early steps to meet user-specified structure.
- Assumptions/dependencies: UI/UX for latent editing; real-time inference; guardrails to prevent semantic drift.
- Safety, attribution, and watermarking embedded in the order of generation
- Sector: policy, platform governance
- What: Insert provenance signals or safety filters in the latent-first phase to make them harder to remove; audit trails based on latent trajectory summaries rather than pixels alone.
- Tools/products: Order-aware watermarking; latent-phase red-teaming; auditing APIs exposing latent trajectory stats.
- Assumptions/dependencies: Robustness of latent-phase marks to post-processing; standardization/acceptance by platforms; privacy considerations for logging intermediate states.
- Domain-critical imaging (e.g., medical) with structure-preserving generation
- Sector: healthcare, scientific imaging
- What: Employ anatomically meaningful self-supervised latents (e.g., for MR/CT) so semantics emerge first; aim for higher faithfulness than lossy latent diffusion; could support data augmentation or anonymized synthesis.
- Tools/products: Domain SSL encoders; regulatory-grade evaluation suites; order-aware conditioning with clinician-defined constraints.
- Assumptions/dependencies: Rigorous validation for clinical safety/effectiveness; bias and hallucination audits; compute and data for domain SSL.
- Evaluation standards and metrics for “diffusability” and ordering
- Sector: academia, standards bodies
- What: Formalize SNR-trajectory-based metrics and benchmarks; protocols for comparing orderings across modalities and domains; guidance best practices (AutoGuidance vs CFG-Interval by modality).
- Tools/products: Open benchmarks; schedule-optimization libraries; teaching toolkits.
- Assumptions/dependencies: Community adoption; consistent datasets and compute baselines.
Notes on feasibility across applications:
- The demonstrated gains are at 256×256 with class-conditional and unconditional ImageNet; generalization to text, higher resolutions, and video requires further evidence.
- Performance depends on the quality and alignment of the chosen latent encoder; for specialized domains, strong self-supervised encoders are a prerequisite.
- While inference pipelines simplify (no external decoder), training remains compute-intensive; end-user latency improvements may require model distillation or fewer sampling steps.
- Guidance strategies are modality-sensitive; the best hybrid (AutoGuidance for pixel steps, CFG-Interval for latent steps) may vary by domain and conditioning.
Glossary
- adaLN-Zero: An adaptive LayerNorm variant used in DiTs that conditions normalization on time and class embeddings. "adaLN-Zero (Peebles & Xie, 2023) adds a learned class embedding to a time embedding for conditioning"
- adversarial loss: A GAN-style loss used to improve realism in autoencoder-based tokenizers/decoders. "adversarial loss (Goodfellow et al., 2014)"
- AutoGuidance: A guidance technique that steers diffusion sampling using a weaker version of the model. "We implement both AutoGuidance (Karras et al., 2024) and Classifier Free Guidance (Ho & Salimans, 2022)"
- cascaded generation: A staged denoising procedure where one modality (e.g., latents) is fully denoised before another (e.g., pixels). "a cascaded schedule that entirely denoises latents before pixels"
- CFG-Interval: Applying classifier-free guidance only over a selected time interval to improve quality. "CFG restricted to an interval (CFG-Interval) (Kynkäänniemi et al., 2024)"
- Classifier-Free Guidance: A guidance method that mixes conditional and unconditional predictions to control fidelity vs diversity. "Classifier Free Guidance (Ho & Salimans, 2022)"
- DINOv2: A self-supervised vision transformer providing robust image representations used as conditioning latents. "we follow prior work (Yao et al., 2025; Yu et al., 2025) and use DINOv2 (Oquab et al., 2023) for the latent space."
- diffusion transformer (DiT): A transformer-based architecture for diffusion models, replacing U-Nets for scalability. "diffusion transformer (DiT) (Peebles & Xie, 2023)"
- diffusability: How amenable a representation space is to diffusion modeling and denoising. "proxies for the 'diffusability' (Skorokhodov et al., 2025) of a tokenizer space"
- ELBO: Evidence Lower Bound; an objective related to likelihood maximization used in variational/diffusion formulations. "while still optimizing for the ELBO of the input data."
- Euler step: A first-order numerical integration step used to update the denoising trajectory. "Then, for an Euler step from global time t to s, we perform"
- FID: Fréchet Inception Distance; a standard metric for evaluating generative image quality/diversity. "FID (Heusel et al., 2017)"
- Flow-Based Diffusion Models: A formulation connecting diffusion to continuous flows and straightened trajectories. "Flow-Based Dif- fusion Models (Liu et al., 2022)"
- Gaussian channel capacity: The information-theoretic upper bound on mutual information in a Gaussian noise channel. "upper-bounded by the Gaussian channel capacity"
- Heun steps: A second-order numerical integration (improved Euler) used for higher-quality sampling. "we use 50. Heun steps"
- JiT: A pixel-space diffusion training setup that predicts denoised targets directly to ease high-dimensional learning. "JiT (Li & He, 2025) demonstrated that by modifying the training output to predict denoised pixels directly"
- KL-regularized autoencoders: Autoencoders trained with a Kullback–Leibler penalty to shape latent distributions. "continuous embeddings from KL-regularized autoencoders (Kingma & Welling, 2013)."
- latent diffusion models: Models that perform diffusion in a learned latent space rather than pixels for efficiency. "Latent diffusion models excel at generating high- quality images"
- logit-normal schedule: A timestep sampling distribution (via a logit-normal) used to balance diffusion training over noise levels. "shifted logit-normal schedule (Karras et al., 2022)."
- mutual information: The amount of information shared between variables, used here to reason about denoising and SNR. "The mutual information of a latent with respect to the noised latent I (Xi; Zi,t¿) is strictly monotonic"
- patchifying: Converting an image into non-overlapping patches to form tokens for transformer-based models. "we first obtain pixel representations by patchifying into 256 tokens"
- perceptual loss: A loss computed in feature space (e.g., VGG) to encourage perceptual similarity. "perceptual loss (Johnson et al., 2016)"
- PSNR: Peak Signal-to-Noise Ratio; a reconstruction quality metric used for tokenizers and ablations. "state-of-the-art latent gen- eration models generally operate with low (<32) PSNR tokenizers."
- REPA: A distillation method that injects pretrained representations into diffusion training to improve quality. "REPA (Yu et al., 2025) demonstrated remarkable gains in diffusion model performance"
- Representation-Conditioned Generation (RCG): Generating internal representation tokens (e.g., CLS) as conditioning signals to guide images. "Representation-Conditioned Generation (RCG) (Li et al., 2023) showed that generating the CLS token of pretrained image models for guidance conditioning can improve both conditional and unconditional generation."
- RoPE: Rotary position embeddings; a positional encoding technique used in transformers. "such as RoPE (Su et al., 2024)."
- scratchpad: An intermediate latent workspace produced during denoising to guide later pixel generation. "which effectively serves as a 'scratchpad' to condition the generation of the natural image"
- self-supervised encoder: An encoder trained without labels whose representations are used as latents for conditioning. "reveal self-supervised encoder latents before pixels"
- Signal-to-Noise Ratio (SNR): The ratio of signal variance to noise variance, central to scheduling and information flow. "SNR is the Signal-to-Noise Ratio."
- time schedule: The distribution/trajectory over noise levels (timesteps) that controls denoising order and difficulty. "The time schedule is a criti- cal choice in efficiently training diffusion models"
- time shift function (fa-shift): A function that reparameterizes timesteps to match an effective scaling of latent magnitude. "fa-shift(t) = ta 1+(a-1)t"
- tokenizer: A learned module that converts images to discrete or continuous latent tokens for generative modeling. "latent 'tokenizers' (Esser et al., 2020; Rombach et al., 2021b; Peebles & Xie, 2023)"
- U-Net architectures: Convolutional encoder–decoder networks with skip connections, historically used in diffusion. "such as U-Net architectures (Ronneberger et al., 2015)."
- v-loss: A loss reweighting/prediction parameterization used with x-prediction to stabilize training. "implement ve as a v-loss with x prediction"
- VQGAN: Vector-quantized GAN; an autoencoder with discrete codes widely used as a tokenizer/decoder in latent diffusion. "VQGAN (Esser et al., 2020) and (Rom- bach et al., 2021a) largely defined the currently existing paradigm of latent diffusion decoders"
- x-prediction: Predicting the clean data x directly (instead of noise or velocity) to improve high-dimensional denoising. "we follow JiT and use x-prediction with v-loss weighting"
Collections
Sign up for free to add this paper to one or more collections.