Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

Published 25 Oct 2024 in cs.CV, cs.LG, and stat.ML | (2410.19324v2)

Abstract: Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to produce higher image quality at high resolution. Here we challenge these notions, and show that pixel-space models can be very competitive to latent models both in quality and efficiency, achieving 1.5 FID on ImageNet512 and new SOTA results on ImageNet128, ImageNet256 and Kinetics600. We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. 1: Use the sigmoid loss-weighting (Kingma & Gao, 2023) with our prescribed hyper-parameters. 2: Use our simplified memory-efficient architecture with fewer skip-connections. 3: Scale the model to favor processing the image at a high resolution with fewer parameters, rather than using more parameters at a lower resolution. Combining these with guidance intervals, we obtain a family of pixel-space diffusion models we call Simpler Diffusion (SiD2).

Abstract PDF HTML Upgrade to Chat

References (42)

Summary

The paper introduces a refined sigmoid loss with tuned hyperparameters that significantly boost pixel-space synthesis performance.
It employs flop heavy model scaling by reducing input patch size, facilitating efficient fine-tuning and robust training at varying resolutions.
The study simplifies the architecture with Residual U-ViTs, reducing memory use while achieving state-of-the-art FID scores on ImageNet benchmarks.

Overview of "Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with Pixel-Space Diffusion"

This paper challenges common perceptions regarding the efficiency and quality of latent diffusion models in high-resolution image synthesis by proposing an enhanced method for training end-to-end pixel-space diffusion models. The authors introduce a novel approach, resulting in a significant improvement over existing pixel-space models, achieving notable results such as a 1.5 FID on ImageNet512 and setting new state-of-the-art results on ImageNet128 and ImageNet256.

Key Contributions

The authors present three primary innovations:

Sigmoid Loss Function with Tuned Hyperparameters: By revisiting and refining the sigmoid loss from previous work, the authors demonstrate that pixel-space models can achieve improved performance compared to EDM-monotonic weightings, especially when balancing the shift of the sigmoid function with the resolution of the images being processed.
Flop Heavy Model Scaling: This involves reducing the patching size of the input rather than expanding the model's parameters or processing at lower resolutions. This approach ensures that the model is more computation-heavy rather than parameter-heavy, improving regularization, and allows for efficient fine-tuning from smaller resolutions without additional parameters.
Simplified Residual U-ViTs Architecture: By removing blockwise skip-connections and replacing them with single residual connections for each downsampling operation, the model simplifies its architecture and reduces memory consumption without sacrificing performance. This is particularly beneficial in larger models where skip-connections are less crucial.

Results and Comparisons

In terms of performance, SiD2 has surpassed other models in specific image resolutions. For ImageNet128 and ImageNet256, it achieves state-of-the-art FID scores, while on ImageNet512, it remains competitive with the best latent diffusion models like EDM2. The SiD2 model significantly reduces training computational requirements compared to its predecessors, while maintaining a high quality in image generation.

Implications and Future Directions

The implications of this work are twofold. Practically, it demonstrates that end-to-end pixel-space diffusion models can rival latent models in terms of quality and efficiency. This could alleviate the need for separate autoencoder training in many applications, facilitating a more streamlined approach to diffusion model training.

Theoretically, the work highlights that significant gains can be achieved by re-examining and simplifying existing model architectures and loss functions. One potential future direction could involve further exploration of the interactions between model architecture choices and loss functions to discover new paths for reducing computational overhead while maintaining or improving model accuracy and efficiency in high-resolution settings.

Overall, this research points towards a promising avenue in the pursuit of more efficient and high-quality diffusion models without relying heavily on latent variable architectures.