OneDP: Single-Step Diffusion Distillation
- OneDP is a method that converts multi-step diffusion models into one-step generators by unlocking pre-trained generative priors through GAN-based fine-tuning.
- It uses a GAN loss combined with strategic freezing of approximately 85.8% of convolutional parameters to align the generator’s output with real data distribution.
- Empirical results demonstrate that OneDP achieves near state-of-the-art image quality with only one network evaluation, significantly reducing inference cost.
Single-step distillation, frequently abbreviated as "OneDP" (Editor's term), encompasses methodologies for reducing the number of sampling steps in diffusion models to one, thereby transforming a pre-trained multi-step diffusion model into a generator whose output is obtained in a single forward pass. In "Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation" (Zheng et al., 11 Jun 2025), this approach is re-examined, with new evidence and a theoretical and practical framework emphasizing GAN-based fine-tuning and architectural freezing. The core insight is that the generative prior learned by diffusion models through multi-step training can be efficiently "unlocked" for one-step generation by leveraging the structure of the trained model, particularly via discriminator-guided fine-tuning with minimal parameter updates.
1. Motivation and Theoretical Background
Traditional diffusion models achieve high-fidelity sample generation by gradually denoising Gaussian noise through a lengthy iterative process, often requiring dozens to hundreds of neural network evaluations per sample. This iterative nature imposes prohibitive inference cost for real-time applications or large-scale deployment. Conventional distillation-based acceleration approaches—such as Progressive Distillation, Consistency Distillation (CD), and Score Identity Distillation (SiD)—impose an ℓ₂ instance-matching loss between multi-step teacher and one-step student outputs:
However, the stochastic, step-size-mismatched, and parametrically divergent nature of student (one-step) and teacher (multi-step) models cause these losses to drive the student into a distinct, frequently suboptimal local minimum. The key structural observation is that, even at equivalent sample quality, the student’s solution basin is inherently distinct, making direct imitation inefficient or suboptimal [(Zheng et al., 11 Jun 2025), Sec. 2.3].
2. Methodological Formulation
The OneDP approach replaces the imitation (distillation) loss entirely with a GAN-based distributional alignment objective. The process consists of initializing a generator with the weights of a diffusion-trained U-Net and performing adversarial training against a discriminator, such that the generator's one-shot samples are indistinguishable from real data in a learned feature space.
2.1 GAN-Only Loss
The adopted loss is the non-saturating GAN loss with r1-regularization: where , , , and to controls discriminator regularization.
2.2 Architectural Freezing
A central innovation is the “D2O-F” form, wherein 85.8% of the generator's parameters—the convolutional layers of the encoder/decoder—are frozen, leaving only normalization layers, QKV projections in the self-attention, and 1×1 residual projections trainable (7.9%, 2.1%, and 4.0% of parameters, respectively). This freezing exploits the observation that the core generative capacity (hierarchical frequency structure) is already encoded in the frozen weights. Fine-tuning only these small subsets is sufficient for near state-of-the-art results [(Zheng et al., 11 Jun 2025), Tab. 3].
2.3 Training Protocol
- Generator and discriminator are trained using Adam (β₁=0, β₂=0.99) with no weight decay and bfloat16 mixed precision.
- Batch sizes: 256 for CIFAR-10 (32×32), 128 for 64×64 benchmarks.
- Exponential moving average (EMA) is applied to generator weights (half-life 0.5 million images).
- No learning rate scheduler is used; adaptive discriminator augmentation is disabled.
Typical hyperparameters:
| Dataset | G lr | D lr | M images | % frozen | Batch |
|---|---|---|---|---|---|
| CIFAR-10 (32×32) | 1e-4 | 1e-4 | 5 | 85.8 | 256 |
| AFHQ/FFHQ (64×64) | 2e-5 | 4e-5 | 5–10 | 85.8 | 128 |
| ImageNet (64×64) | 8e-6 | 4e-5 | 5 | 85.8 | 512 |
3. Architectural Analysis and Frequency-Domain Perspective
Frequency-domain analysis of the diffusion U-Net reveals that deep, low-resolution blocks are specialized to restoring low frequencies, while high-resolution blocks target higher frequencies. During multi-step denoising, lower frequencies are restored first at high noise levels, and higher frequencies are reconstructed as the noise level decreases (Sec. 3.1, Fig. 4). This structure imparts an implicit band-pass decomposition across blocks and time. The GAN-based one-step fine-tuning reconfigures or aligns these frequency-specialized modules such that the generator learns a direct noise-to-image mapping covering the full frequency hierarchy in a single step.
Block-wise frequency specialization analysis (Appendix E, Fig. 6) supports that diffusion pre-training decomposes image generation by distributing frequency tasks across the U-Net architecture, and a lightweight GAN fine-tuning can rapidly align those to produce high-fidelity samples directly.
4. Empirical Results
4.1 Image Generation Benchmarks
| Dataset / Method | NFE | FID | IS | Images (M) | Architecture |
|---|---|---|---|---|---|
| CIFAR-10 D2O | 1 | 1.66 | 10.11 | 5 | Full U-Net |
| CIFAR-10 D2O-F | 1 | 1.54 | 10.10 | 5 | Frozen conv (85%) |
| AFHQv2 D2O | 1 | 1.23 | — | 5 | Full U-Net |
| AFHQv2 D2O-F | 1 | 1.31 | — | 10 | Frozen conv |
| FFHQ D2O | 1 | 1.08 | — | 5 | Full U-Net |
| FFHQ D2O-F | 1 | 0.85 | — | 10 | Frozen conv |
| ImageNet D2O | 1 | 1.42 | — | 5 | Full U-Net |
| ImageNet D2O-F | 1 | 1.16 | — | 5 | Frozen conv |
- D2O-F (frozen-conv) matches or outperforms the full-tuned D2O on all tested datasets (Sec. 4.1–4.3, Tables 4–6).
- Sample quality, as measured by Fréchet Inception Distance (FID), matches or exceeds many-step baselines and prior distilled one-step models.
- GAN-fine-tuned one-step U-Nets achieve near-SOTA with one order-of-magnitude fewer training images (5–10M vs. >100M for standard distillation) and require only a single forward pass at inference.
4.2 Ablations and Qualitative Analysis
- Freezing the convolutional stack is not only compatible with GAN loss but also essential for preserving the frequency decomposition induced by the diffusion pre-training. Attempting to freeze only subset layers within a distillation (CD+GAN) framework leads to collapse.
- Adding an extra distillation (CD) loss to the GAN objective slows convergence and marginally impacts FID, confirming the sufficiency of GAN-fine-tuning as the primary mechanism for single-step distillation.
- Qualitative outputs of D2O/D2O-F are visually similar to multi-step EDM outputs but are not identical—GAN alignment locates a valid optimum close to the data distribution rather than replicating the teacher's mapping (Sec. 4.5).
5. Comparative Analysis and Implications
- Direct GAN adaptation of diffusion U-Nets yields dramatically accelerated convergence to one-step capability as compared to prior instance-matching or hybrid objectives.
- The approach is agnostic to backbone: other architectures (e.g., DiT, ADM, latent models) are compatible with identical or analogous partial-freezing strategies.
- Higher-resolution generation is feasible via upsampled fine-tuning and external super-resolution modules.
- Downstream adaptation (editing, inpainting) can be realized by substituting the GAN loss with a domain-appropriate objective, leveraging the same core frozen architecture.
6. Broader Perspective: Diffusion as Generative Pre-training
The results and analysis presented reinterpret diffusion model training as a form of generative pre-training. The pre-trained U-Net encodes a band-limited hierarchy of priors that, upon GAN fine-tuning, are "unlocked" for instantaneous, one-step image generation without the need for explicit teacher-student imitation. This re-framing suggests that the core limitation previously attributed to one-step distillation—loss of expressiveness or sample quality—stemmed primarily from loss mismatch and not intrinsic model inadequacy (Sec. 5.1). Consequently, OneDP via GAN alignment leverages both the generative basis learned in diffusion pre-training and the distributional matching power of the GAN framework with minimal data and compute.
References:
- "Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation" (Zheng et al., 11 Jun 2025)