LightningDiT-XL/1+IG: Efficient Latent Diffusion
- The paper demonstrates that LightningDiT-XL/1+IG uses a Transformer-based latent diffusion approach with internal guidance, achieving superior FID scores without auxiliary networks.
- It integrates a self-supervised tokenizer and an ODE-based Heun solver to improve denoising accuracy and training efficiency in compressed VAE latent space.
- Empirical results show significant enhancements in sample quality and speed, with minimal extra computational overhead compared to previous latent diffusion models.
LightningDiT-XL/1+IG is a state-of-the-art class-conditional latent diffusion model for image generation, characterized by the integration of the LightningDiT-XL/1 Transformer-based architecture with the Internal Guidance (IG) strategy. Developed for the ImageNet 256×256 benchmark, LightningDiT-XL/1+IG achieves state-of-the-art Fréchet Inception Distance (FID) scores with no requirement for auxiliary networks, extra sampling steps, or classifier-based guidance techniques. The method yields substantial improvements in both sample quality and training efficiency relative to previous latent diffusion models (Zhou et al., 30 Dec 2025).
1. Model Architecture
LightningDiT-XL/1+IG utilizes the LightningDiT-XL/1 backbone, a Transformer-based architecture optimized for latent diffusion in the VAE-compressed image space. The backbone comprises 28 Transformer blocks, each incorporating multi-head self-attention (16 heads, embedding dimension 1152) and MLP layers using GELU or SiLU activations. Input images are encoded by a Stable-Diffusion VAE into a 32×32×4 latent tensor, which is flattened to a sequence of 1024 tokens of dimension 4 and then linearly projected to 1152.
Patch representation diverges from SiT variants by incorporating a self-supervised-representation-based tokenizer (DINOv2-B) in the decoder. The model adopts a v-prediction training objective, and sampling is performed by an ODE-based Heun solver with 125 steps, replacing the commonly used SDE Euler–Maruyama sampler.
Model optimization employs the Muon optimizer (β₁=0.9, β₂=0.95), with an exponential moving average (EMA) decay constant of 0.9995, set lower than related models to stabilize early-stage training.
2. Latent Diffusion Formulation
LightningDiT-XL/1+IG operates within the established latent diffusion framework in VAE-compressed latent space, consistent with architectures such as ADM and DiT. The forward noising process for timestep is defined as:
- In continuous time: , with , and decreasing, increasing in .
The reverse (denoising) process is modeled by:
- is parameterized by a neural network which predicts either (v-prediction) or the noise (ε-prediction).
The training objective minimizes the denoising loss:
- with .
3. Internal Guidance (IG) Training Strategy
Internal Guidance (IG) introduces auxiliary supervision to intermediate Transformer layers, creating an implicit "weaker" version of the model that can serve as a self-guidance signal during sampling. Supervision is applied at the initial eight blocks of the 28-block LightningDiT-XL/1 model. The model computes intermediate and final outputs; corresponding losses are:
The total loss combines intermediate and final outputs:
- with in practice.
This auxiliary objective alleviates gradient vanishing in deep Transformers and ensures intermediate representations act as degraded versions suitable for internal guidance.
4. Generative Sampling with IG
During generative sampling, LightningDiT-XL/1+IG leverages both intermediate and final predictions at each denoising step :
- Compute and .
- Form a linearly extrapolated output: , with .
This extrapolated prediction is then used in one step of the Heun ODE solver to advance the denoising process. The IG scale is typically 1.4, with a piecewise-constant interval applied; within this interval and elsewhere. Combination with classifier-free guidance (CFG) is possible, using applied simultaneously. The sampling requires no additional networks or degradation strategies.
Pseudocode for the IG-guided sampling step is:
1 2 3 4 5 6 7 |
x_T ~ N(0, I) for t = T,...,1: d_i = D_i(x_t, t) d_f = D_f(x_t, t) d_w = d_i + w(σ_t) * (d_f - d_i) x_{t-1} = HeunStep(x_t, d_w, t) return x_0 |
5. Empirical Evaluation
LightningDiT-XL/1+IG demonstrates superior performance on ImageNet 256×256 in class-conditional generation. Comparative results across methods are summarized below:
| Method | Epochs | FID ↓ | sFID ↓ | IS ↑ | Prec. ↑ | Rec. ↑ |
|---|---|---|---|---|---|---|
| LightningDiT-XL/1 | 800 | 2.17 | 4.36 | 205.6 | 0.77 | 0.65 |
| LightningDiT-XL/1+IG (1.4) | 60 | 2.42 | 3.81 | 173.7 | 0.79 | 0.62 |
| LightningDiT-XL/1+IG (1.4) | 680 | 1.34 | 3.94 | 229.3 | 0.78 | 0.65 |
| +CFG (1.45) | 680 | 1.19 | 4.11 | 269.0 | 0.79 | 0.66 |
| SiT-XL/2+IG | 80 | 5.31 | — | — | — | — |
| SiT-XL/2+IG | 800 | 1.75 | — | — | — | — |
LightningDiT-XL/1+IG achieves FID 1.34 without CFG and state-of-the-art FID 1.19 with CFG at 680 epochs, surpassing previous models in both speed and quality. Notably, IG adds only 0.44% parameters, 0.01% FLOPs, and 0.16% sampling latency, while reducing FID by 70–80% compared to the baseline.
6. Implementation and Computational Profile
LightningDiT-XL/1+IG is realized in a compressed latent space using the "sd-vae-ft-ema" Stable Diffusion VAE, where . The Muon optimizer is configured with a learning rate of , no weight decay, and EMA 0.9995. Training utilizes a batch size of 1024 for XL-scale models and can extend to 680 epochs (∼40M steps). Sampler is a Heun ODE solver with 125 timesteps, with no reliance on classifier-based or external degradation guidance.
Model training and sampling implement mixed precision (fp16) arithmetic and utilize FlashAttention for computational acceleration. Computations are performed on clusters of NVIDIA A6000 and 4090Ti GPUs.
The simplicity of IG introduces minimal computational overhead, establishing LightningDiT-XL/1+IG as an efficient, high-fidelity generative model without the increased complexity of classifier or auxiliary network-based guidance strategies (Zhou et al., 30 Dec 2025).