Papers
Topics
Authors
Recent
Search
2000 character limit reached

LightningDiT-XL/1+IG: Efficient Latent Diffusion

Updated 5 January 2026
  • The paper demonstrates that LightningDiT-XL/1+IG uses a Transformer-based latent diffusion approach with internal guidance, achieving superior FID scores without auxiliary networks.
  • It integrates a self-supervised tokenizer and an ODE-based Heun solver to improve denoising accuracy and training efficiency in compressed VAE latent space.
  • Empirical results show significant enhancements in sample quality and speed, with minimal extra computational overhead compared to previous latent diffusion models.

LightningDiT-XL/1+IG is a state-of-the-art class-conditional latent diffusion model for image generation, characterized by the integration of the LightningDiT-XL/1 Transformer-based architecture with the Internal Guidance (IG) strategy. Developed for the ImageNet 256×256 benchmark, LightningDiT-XL/1+IG achieves state-of-the-art Fréchet Inception Distance (FID) scores with no requirement for auxiliary networks, extra sampling steps, or classifier-based guidance techniques. The method yields substantial improvements in both sample quality and training efficiency relative to previous latent diffusion models (Zhou et al., 30 Dec 2025).

1. Model Architecture

LightningDiT-XL/1+IG utilizes the LightningDiT-XL/1 backbone, a Transformer-based architecture optimized for latent diffusion in the VAE-compressed image space. The backbone comprises 28 Transformer blocks, each incorporating multi-head self-attention (16 heads, embedding dimension 1152) and MLP layers using GELU or SiLU activations. Input images are encoded by a Stable-Diffusion VAE into a 32×32×4 latent tensor, which is flattened to a sequence of 1024 tokens of dimension 4 and then linearly projected to 1152.

Patch representation diverges from SiT variants by incorporating a self-supervised-representation-based tokenizer (DINOv2-B) in the decoder. The model adopts a v-prediction training objective, and sampling is performed by an ODE-based Heun solver with 125 steps, replacing the commonly used SDE Euler–Maruyama sampler.

Model optimization employs the Muon optimizer (β₁=0.9, β₂=0.95), with an exponential moving average (EMA) decay constant of 0.9995, set lower than related models to stabilize early-stage training.

2. Latent Diffusion Formulation

LightningDiT-XL/1+IG operates within the established latent diffusion framework in VAE-compressed latent space, consistent with architectures such as ADM and DiT. The forward noising process for timestep t{1,,T}t \in \{1,\dots,T\} is defined as:

  • q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)
  • In continuous time: xt=αtx0+σtϵx_t = \alpha_t x_0 + \sigma_t \epsilon, with ϵN(0,I)\epsilon \sim \mathcal{N}(0,I), and αt\alpha_t decreasing, σt\sigma_t increasing in tt.

The reverse (denoising) process is modeled by:

  • pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t,t), \sigma_t^2 I)
  • μθ\mu_\theta is parameterized by a neural network DθD_\theta which predicts either x0x_0 (v-prediction) or the noise ϵ\epsilon (ε-prediction).

The training objective minimizes the denoising loss:

  • Lsimple=Ex0pdata,t,ϵN(0,I)Dθ(xt,t)x02L_{\text{simple}} = \mathbb{E}_{x_0 \sim p_{\text{data}}, t, \epsilon \sim \mathcal{N}(0,I)} \lVert D_\theta(x_t, t) - x_0 \rVert^2
  • with xt=αtx0+σtϵx_t = \alpha_t x_0 + \sigma_t \epsilon.

3. Internal Guidance (IG) Training Strategy

Internal Guidance (IG) introduces auxiliary supervision to intermediate Transformer layers, creating an implicit "weaker" version of the model that can serve as a self-guidance signal during sampling. Supervision is applied at the initial eight blocks of the 28-block LightningDiT-XL/1 model. The model computes intermediate Di(,t)D_i(\cdot, t) and final Df(,t)D_f(\cdot, t) outputs; corresponding losses are:

  • Linter=E[Di(xt,t)x02]L_{\text{inter}} = \mathbb{E} [\lVert D_i(x_t, t) - x_0 \rVert^2]
  • Lfinal=E[Df(xt,t)x02]L_{\text{final}} = \mathbb{E} [\lVert D_f(x_t, t) - x_0 \rVert^2]

The total loss combines intermediate and final outputs:

  • L=Lfinal+λLinterL = L_{\text{final}} + \lambda L_{\text{inter}}
  • with λ=0.5\lambda = 0.5 in practice.

This auxiliary objective alleviates gradient vanishing in deep Transformers and ensures intermediate representations act as degraded versions suitable for internal guidance.

4. Generative Sampling with IG

During generative sampling, LightningDiT-XL/1+IG leverages both intermediate and final predictions at each denoising step tt:

  • Compute di=Di(xt,t)d_i = D_i(x_t, t) and df=Df(xt,t)d_f = D_f(x_t, t).
  • Form a linearly extrapolated output: dw=di+w(dfdi)d_w = d_i + w(d_f - d_i), with w>1w > 1.

This extrapolated prediction dwd_w is then used in one step of the Heun ODE solver to advance the denoising process. The IG scale ww is typically 1.4, with a piecewise-constant interval (σlow,σhigh)(0.3,1)(\sigma_{\text{low}}, \sigma_{\text{high}}) \approx (0.3, 1) applied; w(σ)=ww(\sigma) = w within this interval and w(σ)=1w(\sigma) = 1 elsewhere. Combination with classifier-free guidance (CFG) is possible, using wcfg=1.45w_{\text{cfg}} = 1.45 applied simultaneously. The sampling requires no additional networks or degradation strategies.

Pseudocode for the IG-guided sampling step is:

1
2
3
4
5
6
7
x_T ~ N(0, I)
for t = T,...,1:
    d_i = D_i(x_t, t)
    d_f = D_f(x_t, t)
    d_w = d_i + w(σ_t) * (d_f - d_i)
    x_{t-1} = HeunStep(x_t, d_w, t)
return x_0

5. Empirical Evaluation

LightningDiT-XL/1+IG demonstrates superior performance on ImageNet 256×256 in class-conditional generation. Comparative results across methods are summarized below:

Method Epochs FID ↓ sFID ↓ IS ↑ Prec. ↑ Rec. ↑
LightningDiT-XL/1 800 2.17 4.36 205.6 0.77 0.65
LightningDiT-XL/1+IG (1.4) 60 2.42 3.81 173.7 0.79 0.62
LightningDiT-XL/1+IG (1.4) 680 1.34 3.94 229.3 0.78 0.65
+CFG (1.45) 680 1.19 4.11 269.0 0.79 0.66
SiT-XL/2+IG 80 5.31
SiT-XL/2+IG 800 1.75

LightningDiT-XL/1+IG achieves FID 1.34 without CFG and state-of-the-art FID 1.19 with CFG at 680 epochs, surpassing previous models in both speed and quality. Notably, IG adds only 0.44% parameters, 0.01% FLOPs, and 0.16% sampling latency, while reducing FID by 70–80% compared to the baseline.

6. Implementation and Computational Profile

LightningDiT-XL/1+IG is realized in a compressed latent space using the "sd-vae-ft-ema" Stable Diffusion VAE, where zR32×32×4z \in \mathbb{R}^{32 \times 32 \times 4}. The Muon optimizer is configured with a learning rate of 2×1042 \times 10^{-4}, no weight decay, and EMA 0.9995. Training utilizes a batch size of 1024 for XL-scale models and can extend to 680 epochs (∼40M steps). Sampler is a Heun ODE solver with 125 timesteps, with no reliance on classifier-based or external degradation guidance.

Model training and sampling implement mixed precision (fp16) arithmetic and utilize FlashAttention for computational acceleration. Computations are performed on clusters of NVIDIA A6000 and 4090Ti GPUs.

The simplicity of IG introduces minimal computational overhead, establishing LightningDiT-XL/1+IG as an efficient, high-fidelity generative model without the increased complexity of classifier or auxiliary network-based guidance strategies (Zhou et al., 30 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LightningDiT-XL/1+IG.