OneDP: Single-Step Diffusion Distillation

Updated 21 January 2026

OneDP is a method that converts multi-step diffusion models into one-step generators by unlocking pre-trained generative priors through GAN-based fine-tuning.
It uses a GAN loss combined with strategic freezing of approximately 85.8% of convolutional parameters to align the generator’s output with real data distribution.
Empirical results demonstrate that OneDP achieves near state-of-the-art image quality with only one network evaluation, significantly reducing inference cost.

Single-step distillation, frequently abbreviated as "OneDP" (Editor's term), encompasses methodologies for reducing the number of sampling steps in diffusion models to one, thereby transforming a pre-trained multi-step diffusion model into a generator whose output is obtained in a single forward pass. In "Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation" (Zheng et al., 11 Jun 2025), this approach is re-examined, with new evidence and a theoretical and practical framework emphasizing GAN-based fine-tuning and architectural freezing. The core insight is that the generative prior learned by diffusion models through multi-step training can be efficiently "unlocked" for one-step generation by leveraging the structure of the trained model, particularly via discriminator-guided fine-tuning with minimal parameter updates.

1. Motivation and Theoretical Background

Traditional diffusion models achieve high-fidelity sample generation by gradually denoising Gaussian noise through a lengthy iterative process, often requiring dozens to hundreds of neural network evaluations per sample. This iterative nature imposes prohibitive inference cost for real-time applications or large-scale deployment. Conventional distillation-based acceleration approaches—such as Progressive Distillation, Consistency Distillation (CD), and Score Identity Distillation (SiD)—impose an ℓ₂ instance-matching loss between multi-step teacher and one-step student outputs:

$\mathcal{L}_\text{distill} = \mathbb{E}_{x,z}[\|F_e(z) - H(z)\|^2]$

However, the stochastic, step-size-mismatched, and parametrically divergent nature of student (one-step) and teacher (multi-step) models cause these losses to drive the student into a distinct, frequently suboptimal local minimum. The key structural observation is that, even at equivalent sample quality, the student’s solution basin is inherently distinct, making direct imitation inefficient or suboptimal [(Zheng et al., 11 Jun 2025), Sec. 2.3].

2. Methodological Formulation

The OneDP approach replaces the imitation (distillation) loss entirely with a GAN-based distributional alignment objective. The process consists of initializing a generator with the weights of a diffusion-trained U-Net and performing adversarial training against a discriminator, such that the generator's one-shot samples are indistinguishable from real data in a learned feature space.

2.1 GAN-Only Loss

The adopted loss is the non-saturating GAN loss with r1-regularization: $\begin{align*} \mathcal{L}_D &= -\mathbb{E}_x [\log D(x)] - \mathbb{E}_z [\log(1 - D(\hat{x}))] + \frac{\lambda_{r1}}{2} \mathbb{E}_x[\|\nabla_x D(x)\|^2] \ \mathcal{L}_G &= -\mathbb{E}_z [\log D(\hat{x})] \end{align*}$ where $\hat{x} = G_e(z)$ , $x \sim p_{\text{data}}$ , $z \sim \mathcal{N}(0,I)$ , and $\lambda_{r1} \approx 10^{-4}$ to $10^{-3}$ controls discriminator regularization.

2.2 Architectural Freezing

A central innovation is the “D2O-F” form, wherein 85.8% of the generator's parameters—the convolutional layers of the encoder/decoder—are frozen, leaving only normalization layers, QKV projections in the self-attention, and 1×1 residual projections trainable (7.9%, 2.1%, and 4.0% of parameters, respectively). This freezing exploits the observation that the core generative capacity (hierarchical frequency structure) is already encoded in the frozen weights. Fine-tuning only these small subsets is sufficient for near state-of-the-art results [(Zheng et al., 11 Jun 2025), Tab. 3].

2.3 Training Protocol

Generator and discriminator are trained using Adam (β₁=0, β₂=0.99) with no weight decay and bfloat16 mixed precision.
Batch sizes: 256 for CIFAR-10 (32×32), 128 for 64×64 benchmarks.
Exponential moving average (EMA) is applied to generator weights (half-life 0.5 million images).
No learning rate scheduler is used; adaptive discriminator augmentation is disabled.

Typical hyperparameters:

Dataset	G lr	D lr	M images	% frozen	Batch
CIFAR-10 (32×32)	1e-4	1e-4	5	85.8	256
AFHQ/FFHQ (64×64)	2e-5	4e-5	5–10	85.8	128
ImageNet (64×64)	8e-6	4e-5	5	85.8	512

3. Architectural Analysis and Frequency-Domain Perspective

Frequency-domain analysis of the diffusion U-Net reveals that deep, low-resolution blocks are specialized to restoring low frequencies, while high-resolution blocks target higher frequencies. During multi-step denoising, lower frequencies are restored first at high noise levels, and higher frequencies are reconstructed as the noise level decreases (Sec. 3.1, Fig. 4). This structure imparts an implicit band-pass decomposition across blocks and time. The GAN-based one-step fine-tuning reconfigures or aligns these frequency-specialized modules such that the generator learns a direct noise-to-image mapping covering the full frequency hierarchy in a single step.

Block-wise frequency specialization analysis (Appendix E, Fig. 6) supports that diffusion pre-training decomposes image generation by distributing frequency tasks across the U-Net architecture, and a lightweight GAN fine-tuning can rapidly align those to produce high-fidelity samples directly.

4. Empirical Results

4.1 Image Generation Benchmarks

Dataset / Method	NFE	FID	IS	Images (M)	Architecture
CIFAR-10 D2O	1	1.66	10.11	5	Full U-Net
CIFAR-10 D2O-F	1	1.54	10.10	5	Frozen conv (85%)
AFHQv2 D2O	1	1.23	—	5	Full U-Net
AFHQv2 D2O-F	1	1.31	—	10	Frozen conv
FFHQ D2O	1	1.08	—	5	Full U-Net
FFHQ D2O-F	1	0.85	—	10	Frozen conv
ImageNet D2O	1	1.42	—	5	Full U-Net
ImageNet D2O-F	1	1.16	—	5	Frozen conv

D2O-F (frozen-conv) matches or outperforms the full-tuned D2O on all tested datasets (Sec. 4.1–4.3, Tables 4–6).
Sample quality, as measured by Fréchet Inception Distance (FID), matches or exceeds many-step baselines and prior distilled one-step models.
GAN-fine-tuned one-step U-Nets achieve near-SOTA with one order-of-magnitude fewer training images (5–10M vs. >100M for standard distillation) and require only a single forward pass at inference.

4.2 Ablations and Qualitative Analysis

Freezing the convolutional stack is not only compatible with GAN loss but also essential for preserving the frequency decomposition induced by the diffusion pre-training. Attempting to freeze only subset layers within a distillation (CD+GAN) framework leads to collapse.
Adding an extra distillation (CD) loss to the GAN objective slows convergence and marginally impacts FID, confirming the sufficiency of GAN-fine-tuning as the primary mechanism for single-step distillation.
Qualitative outputs of D2O/D2O-F are visually similar to multi-step EDM outputs but are not identical—GAN alignment locates a valid optimum close to the data distribution rather than replicating the teacher's mapping (Sec. 4.5).

5. Comparative Analysis and Implications

Direct GAN adaptation of diffusion U-Nets yields dramatically accelerated convergence to one-step capability as compared to prior instance-matching or hybrid objectives.
The approach is agnostic to backbone: other architectures (e.g., DiT, ADM, latent models) are compatible with identical or analogous partial-freezing strategies.
Higher-resolution generation is feasible via upsampled fine-tuning and external super-resolution modules.
Downstream adaptation (editing, inpainting) can be realized by substituting the GAN loss with a domain-appropriate objective, leveraging the same core frozen architecture.

6. Broader Perspective: Diffusion as Generative Pre-training

The results and analysis presented reinterpret diffusion model training as a form of generative pre-training. The pre-trained U-Net encodes a band-limited hierarchy of priors that, upon GAN fine-tuning, are "unlocked" for instantaneous, one-step image generation without the need for explicit teacher-student imitation. This re-framing suggests that the core limitation previously attributed to one-step distillation—loss of expressiveness or sample quality—stemmed primarily from loss mismatch and not intrinsic model inadequacy (Sec. 5.1). Consequently, OneDP via GAN alignment leverages both the generative basis learned in diffusion pre-training and the distributional matching power of the GAN framework with minimal data and compute.

References:

"Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation" (Zheng et al., 11 Jun 2025)

Markdown Report Issue Upgrade to Chat

References (1)

Revisiting Diffusion Models: From Generative Pre-training to One-Step Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Single-step Distillation (OneDP).