Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scaling Diffusion Transformers Efficiently via $μ$P

Published 21 May 2025 in cs.LG, cs.AI, and cs.CV | (2505.15270v1)

Abstract: Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to LLMs, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.

Summary

Scaling Diffusion Transformers Efficiently via μ\muP

In the paper Scaling Diffusion Transformers Efficiently via μ\muP, the authors investigate the application of Maximal Update Parametrization (μ\muP) to diffusion transformers, aiming to efficiently scale these models. Diffusion transformers have become integral to generative modeling in the vision domain, providing scalable architectures for tasks like image and video generation. However, their performance at scale is hindered by the high cost associated with hyperparameter tuning. The paper proposes a solution by adapting μ\muP, originally developed for vanilla transformers, to diffusion transformers.

Core Contributions and Results

  1. Generalizing μ\muP to Diffusion Transformers: The paper rigorously proves that the μ\muP formulation for mainstream diffusion transformers, such as DiT, U-ViT, PixArt-α\alpha, and MMDiT, aligns with that of vanilla transformers. This compatibility enables the direct application of existing μ\muP methodologies, facilitating robust hyperparameter transferability across model scales.
  2. Performance Improvements: The authors demonstrate that implementing μ\muP in diffusion transformers offers substantial improvements. For instance, the DiT-XL-2-μ\muP model exhibited a convergence rate 2.9 times faster compared to the traditional DiT-XL-2 model, highlighting the potential for enhanced training efficiency.
  3. Scaling Text-to-Image Applications: The paper validates the effectiveness of μ\muP in practical applications by scaling PixArt-α\alpha from 0.04B to 0.61B parameters, and MMDiT from 0.18B to 18B parameters. In both scenarios, μ\muP models outperformed their baselines with significantly lower tuning costs—just 5.5% and 3% of the corresponding full training costs, respectively.

Implications and Future Directions

The findings of this paper have significant implications for the field of generative modeling, especially as models continue to grow in size. Reducing the computational and time costs associated with hyperparameter tuning opens up pathways for more extensive and cost-efficient deployment of diffusion models, potentially revolutionizing applications in various domains such as art creation, virtual reality, and even synthetic data generation.

Theoretically, the paper lays the groundwork for future research into the scaling behaviors of diffusion transformers and the development of more sophisticated scaling laws that could further optimize large model training. Additionally, the robust results from applying μ\muP suggest that similar approaches could be explored for other types of models and tasks beyond vision, such as audio or multimodal synthesis.

Conclusion

Scaling Diffusion Transformers Efficiently via μ\muP introduces a practical and theoretically sound method for scaling diffusion transformers by leveraging the μ\muP framework. The research provides a pathway for efficient hyperparameter tuning across different model scales, enhancing training speed and reducing computational costs. This work will likely catalyze further innovation in efficient model scaling, with broader implications for the future of artificial intelligence and generative models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 18 likes about this paper.