EM Distillation for One-step Diffusion Models

Published 27 May 2024 in cs.LG, cs.AI, and stat.ML | (2405.16852v2)

Abstract: While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.

Abstract PDF HTML Upgrade to Chat

References (77)

Citations (14)

View on Semantic Scholar

Summary

The paper introduces EM Distillation as a novel one-step generation method that converts diffusion models using an EM framework.
It applies a reparametrized sampling scheme with noise cancellation to stabilize performance and simplify hyperparameter tuning.
Empirical results show superior FID scores on ImageNet benchmarks and competitive outcomes in one-step text-to-image generation tasks.

Expectation-Maximization Distillation for Diffusion Models: Efficient One-Step Generation

The paper presents a novel methodology called EM Distillation (EMD) aimed at reducing the computational overhead associated with sampling from diffusion models. Despite diffusion models' demonstrated capability to generate high-quality images, their iterative sampling process is exceedingly resource-intensive. Current distillation techniques, although beneficial, frequently falter when the number of sampling steps is minimized. These methods often experience a degradation in performance, rely excessively on training data, or fail to capture the full data distribution due to mode-seeking optimization biases.

Proposed Method: EM Distillation (EMD)

The proposed EMD method stands out by utilizing an Expectation-Maximization (EM) framework to convert a diffusion model into a one-step generator model with minimal perceptual degradation. This technique entails updating the generator's parameters using samples derived from the joint distribution of the diffusion teacher's prior and the inferred generator latents. The authors introduce an innovative reparametrized sampling scheme and a noise cancellation approach to enhance the stability of the distillation process. It is noteworthy that EMD establishes a link between their method and existing mode-seeking KL minimization strategies.

Mechanism

EMD commences with a diffusion model—functioning as the teacher—that employs a forward process transforming the complex data distribution into a Gaussian distribution, which is later reversed to generate data via solving an SDE or an equivalent ODE. EMD's core contribution lies in framing this reverse transformation through EM:

E-step (Expectation Step): Calculates the learning gradients using Monte Carlo samples to estimate an inferred latent context.
M-step (Maximization Step): Updates the generator parameters through gradient ascent on the calculated expectations.

A reparametrized sampling scheme coupled with noise cancellation is developed to ensure stability and performance, especially critical when dealing with various noise levels in diffusion models. This reparametrization simplifies hyperparameter tuning and improves short-run MCMC (Markov Chain Monte Carlo) performance.

Numerical Results and Empirical Validation

EMD's efficacy is underscored by numerical results outperforming existing one-step generation methods across multiple benchmarks. The paper reports impressive FID scores of 2.20 on ImageNet-64 and 6.0 on ImageNet-128, along with competitive results on one-step text-to-image generation tasks using distillation from Stable Diffusion models.

Comparative Analysis

When juxtaposed with traditional trajectory distillation techniques and distribution matching approaches, EMD excels particularly in the one-step sampling regime.

Trajectory Distillation: These methods, while reducing sampling steps, struggle in the one-step regime due to their inherent design to progressively solve differential equations correlating with the forward diffusion process.
Distribution Matching: Though allowing for arbitrary generators and often producing compelling results, these strategies tend to collapse modes owing to their minimization of divergences focusing selectively on the most likely modes.

Theoretical and Practical Implications

Theoretically, EMD leverages EM in an innovative fashion to stabilize and enhance diffusion model distillation, pushing the boundaries of efficiency in generative models. Practically, it democratizes real-time generation applications by significantly reducing computational requirements.

Future Directions

The study opens up numerous avenues for future research:

From-Scratch Training: Investigation into achieving competitive performance with randomly initialized generator networks, potentially without model architecture constraints.
Optimization Paradigms: Enhanced strategies for tuning MCMC sampling schemes to pare down training costs while preserving performance, thus refining the computational trade-offs.
Architectural Flexibility: Exploration into more diverse generator architectures which may better capture data distributions across varied domains.

Conclusion

EMD emerges as a substantial advancement in the field of diffusion models, offering a methodologically robust and computationally efficient solution to the inherent complexity of iterative sampling. This research not only addresses crucial bottlenecks in performance and stability but sets the stage for future innovations in generative AI, paving the way for broader real-time application of these models. The paper's blend of empirical success and theoretical elegance marks a significant step forward in the ongoing refinement and application of generative models in AI.

Markdown Report Issue