FreeU: Free Lunch in Diffusion U-Net
Abstract: In this paper, we uncover the untapped potential of diffusion U-Net, which serves as a "free lunch" that substantially improves the generation quality on the fly. We initially investigate the key contributions of the U-Net architecture to the denoising process and identify that its main backbone primarily contributes to denoising, whereas its skip connections mainly introduce high-frequency features into the decoder module, causing the network to overlook the backbone semantics. Capitalizing on this discovery, we propose a simple yet effective method-termed "FreeU" - that enhances generation quality without additional training or finetuning. Our key insight is to strategically re-weight the contributions sourced from the U-Net's skip connections and backbone feature maps, to leverage the strengths of both components of the U-Net architecture. Promising results on image and video generation tasks demonstrate that our FreeU can be readily integrated to existing diffusion models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion, to improve the generation quality with only a few lines of code. All you need is to adjust two scaling factors during inference. Project page: https://chenyangsi.top/FreeU/.
- Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
- ILVR: Conditioning method for denoising diffusion probabilistic models. In ICCV, 2021.
- Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
- ImageBART: Bidirectional context with multinomial diffusion for autoregressive image synthesis. In NeurIPS, 2021a.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021b.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
- Generative adversarial nets. In NeurIPS, 2014.
- Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2022.
- Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- CogVideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
- Collaborative diffusion for multi-modal face generation and editing. In CVPR, 2023a.
- ReVersion: Diffusion-based relation inversion from images. arXiv preprint arXiv:2303.13495, 2023b.
- Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
- Alias-free generative adversarial networks. In NeurIPS, 2021.
- Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Multi-concept customization of text-to-image diffusion. arXiv preprint arXiv:2212.04488, 2022.
- VideoFusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023.
- SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
- Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022b.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Neural discrete representation learning. In NeurIPS, 2017.
- Rethinking and improving the robustness of image style transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 124–133, 2021.
- Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
- Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
- Rerender a video: Zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954, 2023.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.