Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion
Abstract: Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to produce higher image quality at high resolution. Here we challenge these notions, and show that pixel-space models can be very competitive to latent models both in quality and efficiency, achieving 1.5 FID on ImageNet512 and new SOTA results on ImageNet128, ImageNet256 and Kinetics600. We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. 1: Use the sigmoid loss-weighting (Kingma & Gao, 2023) with our prescribed hyper-parameters. 2: Use our simplified memory-efficient architecture with fewer skip-connections. 3: Scale the model to favor processing the image at a high resolution with fewer parameters, rather than using more parameters at a lower resolution. Combining these with guidance intervals, we obtain a family of pixel-space diffusion models we call Simpler Diffusion (SiD2).
- ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. CoRR, abs/2211.01324, 2022.
- All are worth words: A vit backbone for diffusion models. In CVPR, 2023.
- Ting Chen. On the importance of noise scheduling for diffusion models. arxiv, 2023.
- Diffusion models beat gans on image synthesis. CoRR, abs/2105.05233, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- f-dm: A multi-stage diffusion model via progressive signal transformation. CoRR, abs/2210.04955, 2022.
- Matryoshka diffusion models. CoRR, abs/2310.15111, 2023.
- Diffit: Diffusion vision transformers for image generation. CoRR, abs/2312.02139, 2023. doi: 10.48550/ARXIV.2312.02139. URL https://doi.org/10.48550/arXiv.2312.02139.
- Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS, 2020.
- Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47:1–47:33, 2022.
- simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, ICML, volume 202 of Proceedings of Machine Learning Research, pp. 13213–13232. PMLR, 2023.
- Scalelong: Towards more stable training of diffusion model via scaling network long skip connection. In NeurIPS, 2023.
- Scalable adaptive computation for iterative generation. CoRR, abs/2212.11972, 2022.
- Scedit: Efficient and controllable image diffusion generation via skip connection editing. Technical Report 2312.11392, arXiv, 2023.
- Distribution augmentation for generative modeling. In Proceedings of the 37th International Conference on Machine Learning, ICML, 2020.
- Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS, 2022.
- Analyzing and improving the training dynamics of diffusion models. CoRR, abs/2312.02696, 2023.
- Guiding a diffusion model with a bad version of itself. CoRR, abs/2406.02507, 2024.
- Consistency trajectory models: Learning probability flow ODE trajectory of diffusion. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=ymjI8feDTD.
- Pagoda: Progressive growing of a one-step generator from a low-resolution diffusion teacher. CoRR, abs/2405.14822, 2024b.
- Understanding the diffusion objective as a weighted integral of elbos. CoRR, abs/2303.00848, 2023.
- Variational diffusion models. CoRR, abs/2107.00630, 2021.
- Applying guidance in a limited interval improves sample and distribution quality in diffusion models. CoRR, abs/2404.07724, 2024.
- Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda. OpenReview.net, 2023.
- Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
- The surprising effectiveness of skip-tuning in diffusion sampling. In Proceedings of the 41st International Conference on Machine Learning, pp. 34053–34074, 2024. URL https://proceedings.mlr.press/v235/ma24r.html.
- Scalable diffusion models with transformers. CoRR, abs/2212.09748, 2022.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 10674–10685. IEEE, 2022.
- U-net: Convolutional networks for biomedical image segmentation. Technical report, ArXiV, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. CoRR, abs/2205.11487, 2022.
- Progressive distillation for fast sampling of diffusion models. In The Tenth International Conference on Learning Representations, ICLR. OpenReview.net, 2022.
- Multistep distillation of diffusion models via moment matching. arXiv preprint arXiv:2406.04103, 2024.
- Stylegan-xl: Scaling stylegan to large diverse datasets. In Munkhtsetseg Nandigjav, Niloy J. Mitra, and Aaron Hertzmann (eds.), SIGGRAPH ’22: Special Interest Group on Computer Graphics and Interactive Techniques Conference, pp. 49:1–49:10. ACM, 2022.
- Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640–651, 2016.
- Score-based generative modeling through stochastic differential equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Tackling the generative learning trilemma with denoising diffusion gans. In International Conference on Learning Representations, 2021.
- Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023.
- Disco-diff: Enhancing continuous diffusion models with discrete latents. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.
- One-step diffusion with distribution matching distillation. arXiv preprint arXiv:2311.18828, 2023.
- Improved distribution matching distillation for fast image synthesis. CoRR, abs/2405.14867, 2024.
- Scaling autoregressive models for content-rich text-to-image generation. CoRR, abs/2206.10789, 2022.
- Language model beats diffusion - tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representations, ICLR. OpenReview.net, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.