Papers
Topics
Authors
Recent
Search
2000 character limit reached

FreeU: Free Lunch in Diffusion U-Net

Published 20 Sep 2023 in cs.CV | (2309.11497v2)

Abstract: In this paper, we uncover the untapped potential of diffusion U-Net, which serves as a "free lunch" that substantially improves the generation quality on the fly. We initially investigate the key contributions of the U-Net architecture to the denoising process and identify that its main backbone primarily contributes to denoising, whereas its skip connections mainly introduce high-frequency features into the decoder module, causing the network to overlook the backbone semantics. Capitalizing on this discovery, we propose a simple yet effective method-termed "FreeU" - that enhances generation quality without additional training or finetuning. Our key insight is to strategically re-weight the contributions sourced from the U-Net's skip connections and backbone feature maps, to leverage the strengths of both components of the U-Net architecture. Promising results on image and video generation tasks demonstrate that our FreeU can be readily integrated to existing diffusion models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion, to improve the generation quality with only a few lines of code. All you need is to adjust two scaling factors during inference. Project page: https://chenyangsi.top/FreeU/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
  2. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  3. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  4. ILVR: Conditioning method for denoising diffusion probabilistic models. In ICCV, 2021.
  5. Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
  6. ImageBART: Bidirectional context with multinomial diffusion for autoregressive image synthesis. In NeurIPS, 2021a.
  7. Taming transformers for high-resolution image synthesis. In CVPR, 2021b.
  8. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
  9. Generative adversarial nets. In NeurIPS, 2014.
  10. Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2022.
  11. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  12. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  13. CogVideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  14. Collaborative diffusion for multi-modal face generation and editing. In CVPR, 2023a.
  15. ReVersion: Diffusion-based relation inversion from images. arXiv preprint arXiv:2303.13495, 2023b.
  16. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018.
  17. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  18. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
  19. Alias-free generative adversarial networks. In NeurIPS, 2021.
  20. Imagic: Text-based real image editing with diffusion models. arXiv preprint arXiv:2210.09276, 2022.
  21. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  22. Multi-concept customization of text-to-image diffusion. arXiv preprint arXiv:2212.04488, 2022.
  23. VideoFusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023.
  24. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
  25. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  26. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  27. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  28. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
  29. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  30. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  31. Palette: Image-to-image diffusion models. In ACM SIGGRAPH, 2022a.
  32. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022b.
  33. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  34. Neural discrete representation learning. In NeurIPS, 2017.
  35. Rethinking and improving the robustness of image style transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 124–133, 2021.
  36. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
  37. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023.
  38. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
  39. Rerender a video: Zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954, 2023.
  40. Show-1: Marrying pixel and latent diffusion models for text-to-video generation, 2023.
Citations (81)

Summary

  • The paper introduces FreeU, a method that boosts diffusion model quality by strategically balancing the U-Net backbone and skip connections.
  • It employs two scaling factors, including Fourier-based spectral modulation, to optimize feature contributions during the denoising process.
  • Experimental results on models like Stable Diffusion confirm improved visual fidelity and semantic alignment without incurring extra computational costs.

Analysis of "FreeU: Free Lunch in Diffusion U-Net"

The paper entitled "FreeU: Free Lunch in Diffusion U-Net" introduces a method to enhance the performance of diffusion models using a computationally efficient approach termed FreeU. Diffusion models are a prominent category of generative models, especially relevant in computer vision tasks due to their ability to generate high-quality samples across diverse applications, from image synthesis to text-to-video generation. The novelty of this research lies in its ability to improve sample quality without the need for additional training, fine-tuning, or increased computational resources, offering a truly "free lunch."

Key Contributions

The authors identify the pivotal roles of the U-Net architecture's components in the denoising process intrinsic to diffusion models. They observe that the backbone primarily aids in denoising, while skip connections largely introduce high-frequency features, sometimes leading to suboptimal attention to backbone semantics. Leveraging this understanding, the authors propose the "FreeU" method, which involves the strategic re-weighting of contributions from both skip connections and backbone feature maps. This balance helps to capitalize on the strengths of each component, achieving enhanced generation quality by simply adjusting two scaling factors during inference.

Methodology

FreeU operates on the simpler premise of modulating features from skip connections and the backbone of U-Net architectures. Two scaling factors, one for the backbone and one for the skip connections, are introduced. The researchers employ a structure-related scaling for the backbone, multiplying averages along the channel to preserve structural integrity and avoid oversmoothing. In skip connections, they implemented spectral modulation in the Fourier domain to selectively diminish low-frequency components, addressing the trade-off between retaining high-frequency detail and effective denoising.

Experimental Results

Extensive experimentation demonstrates the effectiveness of FreeU across multiple models and tasks. It is successfully integrated into models such as Stable Diffusion and ModelScope, showing significant improvement in the quality of generated samples. Visual fidelity and semantic alignment of images and videos improved with the inclusion of FreeU. Quantitative evaluations, conducted via user studies, corroborated these findings, confirming that augmented models using FreeU were consistently preferred over unmodified versions.

Implications and Future Directions

The implications of FreeU are notable for both practical and theoretical advancements in diffusion-based generative models. Practically, FreeU provides a mechanism to enhance existing systems without incurring additional computational costs, which is a significant advantage for real-world deployment where resource efficiency is crucial. Theoretically, the insights gained about the interaction between skip connections and the backbone in U-Net architectures could inform future designs of more efficient generative models.

Looking forward, FreeU's methodology may inspire exploration into other model architectures and their component interactions, potentially leading to broader applications. While the paper focuses primarily on visual domains, similar principles could be adapted for different types of data, expanding the reach of diffusion models across various generative tasks. Moreover, the approach could inspire further research into adaptive feature modulation in neural networks, tapping into the latent potentials within existing network architectures.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 4 likes about this paper.