Simple diffusion: End-to-end diffusion for high resolution images
Abstract: Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.
- Anonymous. Discrete predictor-corrector diffusion models for image synthesis. In Submitted to The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=VM8batVBWvg. under review.
- ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. CoRR, abs/2211.01324, 2022.
- Maskgit: Masked generative image transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 11305–11315. IEEE, 2022.
- Muse: Text-to-image generation via masked generative transformers. CoRR, abs/2301.00704, 2023.
- WaveGrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
- Chen, T. On the importance of noise scheduling for diffusion models. arxiv, 2023.
- Diffusion models beat gans on image synthesis. CoRR, abs/2105.05233, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. CoRR, abs/2210.15257, 2022.
- f-dm: A multi-stage diffusion model via progressive signal transformation. CoRR, abs/2210.04955, 2022.
- Classifier-free diffusion guidance. CoRR, abs/2207.12598, 2022. doi: 10.48550/arXiv.2207.12598. URL https://doi.org/10.48550/arXiv.2207.12598.
- Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS, 2020.
- Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47:1–47:33, 2022.
- Scalable adaptive computation for iterative generation. CoRR, abs/2212.11972, 2022.
- Variational diffusion models. CoRR, abs/2107.00630, 2021.
- DiffWave: A versatile diffusion model for audio synthesis. In 9th International Conference on Learning Representations, ICLR, 2021.
- On distillation of guided diffusion models. CoRR, abs/2210.03142, 2022.
- Improved denoising diffusion probabilistic models. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML, 2021.
- GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 16784–16804. PMLR, 2022. URL https://proceedings.mlr.press/v162/nichol22a.html.
- Scalable diffusion models with transformers. CoRR, abs/2212.09748, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
- Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 10674–10685. IEEE, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. CoRR, abs/2205.11487, 2022.
- Progressive distillation for fast sampling of diffusion models. In The Tenth International Conference on Learning Representations, ICLR. OpenReview.net, 2022.
- Stylegan-xl: Scaling stylegan to large diverse datasets. In Nandigjav, M., Mitra, N. J., and Hertzmann, A. (eds.), SIGGRAPH ’22: Special Interest Group on Computer Graphics and Interactive Techniques Conference, pp. 49:1–49:10. ACM, 2022.
- Make-a-video: Text-to-video generation without text-video data. CoRR, abs/2209.14792, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Bach, F. R. and Blei, D. M. (eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015.
- Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, 2019.
- Score-based generative modeling through stochastic differential equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Scaling autoregressive models for content-rich text-to-image generation. CoRR, abs/2206.10789, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.