Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepCache: Accelerating Diffusion Models for Free

Published 1 Dec 2023 in cs.CV and cs.AI | (2312.00858v2)

Abstract: Diffusion models have recently gained unprecedented attention in the field of image synthesis due to their remarkable generative capabilities. Notwithstanding their prowess, these models often incur substantial computational costs, primarily attributed to the sequential denoising process and cumbersome model size. Traditional methods for compressing diffusion models typically involve extensive retraining, presenting cost and feasibility challenges. In this paper, we introduce DeepCache, a novel training-free paradigm that accelerates diffusion models from the perspective of model architecture. DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models, which caches and retrieves features across adjacent denoising stages, thereby curtailing redundant computations. Utilizing the property of the U-Net, we reuse the high-level features while updating the low-level features in a very cheap way. This innovative strategy, in turn, enables a speedup factor of 2.3$\times$ for Stable Diffusion v1.5 with only a 0.05 decline in CLIP Score, and 4.1$\times$ for LDM-4-G with a slight decrease of 0.22 in FID on ImageNet. Our experiments also demonstrate DeepCache's superiority over existing pruning and distillation methods that necessitate retraining and its compatibility with current sampling techniques. Furthermore, we find that under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS. The code is available at https://github.com/horseee/DeepCache

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  2. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
  3. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
  4. Non-uniform diffusion models. arXiv preprint arXiv:2207.09786, 2022.
  5. Token merging for fast stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4598–4602, 2023.
  6. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
  7. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  9. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  10. Structural pruning for diffusion models. In Advances in Neural Information Processing Systems, 2023.
  11. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933, 2022.
  12. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  13. Ptqd: Accurate post-training quantization for diffusion models. arXiv preprint arXiv:2305.10657, 2023.
  14. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  15. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  16. beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.
  17. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  18. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  19. Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
  20. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
  21. On architectural compression of text-to-image diffusion models. arXiv preprint arXiv:2305.15798, 2023.
  22. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  23. On convergence and stability of gans. arXiv preprint arXiv:1705.07215, 2017.
  24. Learning multiple layers of features from tiny images. 2009.
  25. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.
  26. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022a.
  27. Autodiffusion: Training-free optimization of time steps and architectures for automated diffusion model acceleration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7105–7114, 2023a.
  28. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022b.
  29. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. arXiv preprint arXiv:2306.00980, 2023b.
  30. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  31. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  32. Oms-dpm: Optimizing the model schedule for diffusion probabilistic models. arXiv preprint arXiv:2306.08860, 2023.
  33. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  34. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  35. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837–2845, 2021.
  36. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
  37. Accelerating diffusion models via early stop of the diffusion process. arXiv preprint arXiv:2205.12524, 2022.
  38. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  39. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023.
  40. Early exiting for accelerated inference in diffusion models. In ICML 2023 Workshop on Structured Probabilistic Inference {normal-{\{{\normal-\\backslash\&}normal-}\}} Generative Modeling, 2023.
  41. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  42. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  43. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  44. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR, 2021.
  45. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  46. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022a.
  47. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022b.
  48. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  49. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022a.
  50. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022b.
  51. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2022c.
  52. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  53. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  54. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981, 2023.
  55. Parallel sampling of diffusion models. arXiv preprint arXiv:2305.16317, 2023.
  56. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  57. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  58. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  59. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  60. Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pages 574–584. PMLR, 2020b.
  61. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020c.
  62. Consistency models. 2023.
  63. Deediff: Dynamic uncertainty-aware early exiting for accelerating diffusion model generation. arXiv preprint arXiv:2309.17074, 2023.
  64. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
  65. Denoising diffusion step-aware models. arXiv preprint arXiv:2310.03337, 2023a.
  66. Diffusion probabilistic model made slim. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22552–22562, 2023b.
  67. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  68. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  69. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.
  70. Fast sampling of diffusion models via operator learning. In International Conference on Machine Learning, pages 42390–42402. PMLR, 2023.
Citations (73)

Summary

  • The paper introduces a training-free approach that caches high-level U-Net features to reduce redundant computations in diffusion models.
  • It achieves up to 2.3× speedup on Stable Diffusion and 4.1× on LDM models while maintaining near-original quality.
  • The method is compatible with fast samplers and suggests a modular acceleration pathway for efficient generative modeling.

DeepCache: Training-Free Acceleration of Diffusion Models via Feature Caching

Introduction

Diffusion models (DMs) have established themselves as state-of-the-art generative models for tasks ranging from image synthesis to text-to-image and beyond. However, their prevalent sequential reverse denoising process, inherent to models such as DDPM, LDM, and Stable Diffusion, imposes substantial computational costs and latency during inference, presenting challenges to their adoption in resource-constrained and real-time scenarios. Traditional strategies for alleviating this computational burden include sampling step reduction, model pruning, distillation, and quantization, most of which introduce supplementary training or require access to large datasets and compute resources. The paper "DeepCache: Accelerating Diffusion Models for Free" (2312.00858) introduces a distinct, training-free paradigm for accelerating DMs by exploiting temporal redundancy in U-Net-based diffusion architectures.

Methodology

The fundamental insight motivating DeepCache is the high temporal similarity observed in high-level features of the U-Net across adjacent denoising steps. Empirical analysis conducted across multiple diffusion models demonstrates that certain upsampling block features remain nearly invariant between consecutive timesteps. DeepCache leverages this by caching the outputs of selected upsampling blocks during one full forward pass (the "cache step") and reusing these cached features during subsequent denoising iterations (the "retrieve steps"), performing only shallow partial inference for the intervening steps.

The acceleration process is formalized as alternating full and partial U-Net inferences, where for an interval NN, the main branch features (high-level) are recomputed only once and then reused for N1N-1 steps, during which only the lightweight downsampling blocks and relevant skip connections are computed afresh. This enables skipping the majority of the U-Net's computational graph across a substantial fraction of the denoising trajectory. The approach also introduces a non-uniform interval selection mechanism to address the empirical observation that temporal similarity decays non-uniformly across the denoising process, adjusting the cache refresh schedule accordingly to balance computational efficiency with sample fidelity.

Empirical Evaluation

Acceleration and Quality Retention

DeepCache achieves a speedup of up to 2.3×2.3\times on Stable Diffusion v1.5 (50 PLMS steps) with a negligible 0.05 decline in CLIP Score, and 4.1×4.1\times on LDM-4-G (250 DDIM steps) with only a marginal increase (0.22) in FID on ImageNet. On CIFAR-10, LSUN-Bedroom, and LSUN-Churches, DeepCache surpasses post-training pruning and distillation approaches, maintaining superior FID and efficiency without any retraining overhead.

Compatibility and Comparisons

The method is inherently compatible with existing fast samplers (e.g., DDIM, PLMS). When controlling for throughput rather than step count, DeepCache produces comparable or marginally improved generation metrics relative to DDIM and PLMS. For example, at the same throughput, DeepCache with N=3N=3 matches DDIM with 91 steps on LDM-4-G in terms of FID, while offering higher acceleration and lower inference cost.

Structural Analysis and Ablations

DeepCache's efficacy is shown to be highly sensitive to the choice of skip branch for caching; selecting shallow branches offers greater speedups but incurs minimal fidelity degradation for moderate caching intervals (N<5N<5). Ablation studies verify that cached high-level features are nontrivially beneficial for denoising, and that shallow partial inferences with cached features significantly outperform zero-initialized baselines. The trade-off between speed and sample quality is controllable via NN and the branch selection.

Limitations

DeepCache's acceleration upper bound is constrained by the structure of the underlying U-Net. If the shallowest skip branch accounts for a large proportion of network FLOPs, achievable speedups are limited. Performance degrades for large cache intervals (N10N\gg10), as high-level feature staleness becomes significant, particularly in later denoising stages with larger feature variations.

Theoretical and Practical Implications

DeepCache delineates a new axis for efficient inference in diffusion models: exploiting intra-trajectory feature redundancy rather than modifying model weights or architecture via retraining. This differs from distillation and pruning pipelines, making it particularly suitable for deployment with large, fixed pre-trained models (e.g., widely distributed Stable Diffusion checkpoints) where retraining is undesirable or infeasible. The approach interacts synergistically with fast solvers and step reduction methods, suggesting a modularity in DM acceleration pipelines that is orthogonal to prior art.

Future Directions

Given the robustness of DeepCache across models and datasets, prospective extensions include automated search for optimal cache schedules and branch selection, dynamic feature similarity analysis for adaptive caching, or hardware-aware implementations to maximize batch throughput. Integration with quantization or other lightweight architectures may push acceleration further in real-time and edge settings. The caching principle itself may inspire analogous mechanisms in other sequential generative frameworks or even autoregressive transformers.

Conclusion

DeepCache introduces a mechanism-level acceleration technique for diffusion model inference, achieving significant computational gains without retraining or architectural modification. By strategically caching and reusing temporally local high-level U-Net features, DeepCache realizes fast, high-fidelity generation and demonstrates compatibility with other acceleration methodologies. It represents an advance in practitioner-friendly, training-free DM acceleration and opens up new possibilities for efficient generative modeling and deployment (2312.00858).

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 95 likes about this paper.