Scaling Diffusion Transformers Efficiently via $μ$P

Published 21 May 2025 in cs.LG, cs.AI, and cs.CV | (2505.15270v1)

Abstract: Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to LLMs, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.

Abstract PDF Upgrade to Chat

Summary

Scaling Diffusion Transformers Efficiently via $\mu$ P

In the paper Scaling Diffusion Transformers Efficiently via $\mu$ P, the authors investigate the application of Maximal Update Parametrization ( $\mu$ P) to diffusion transformers, aiming to efficiently scale these models. Diffusion transformers have become integral to generative modeling in the vision domain, providing scalable architectures for tasks like image and video generation. However, their performance at scale is hindered by the high cost associated with hyperparameter tuning. The paper proposes a solution by adapting $\mu$ P, originally developed for vanilla transformers, to diffusion transformers.

Core Contributions and Results

Generalizing $\mu$ P to Diffusion Transformers: The paper rigorously proves that the $\mu$ P formulation for mainstream diffusion transformers, such as DiT, U-ViT, PixArt- $\alpha$ , and MMDiT, aligns with that of vanilla transformers. This compatibility enables the direct application of existing $\mu$ P methodologies, facilitating robust hyperparameter transferability across model scales.
Performance Improvements: The authors demonstrate that implementing $\mu$ P in diffusion transformers offers substantial improvements. For instance, the DiT-XL-2- $\mu$ P model exhibited a convergence rate 2.9 times faster compared to the traditional DiT-XL-2 model, highlighting the potential for enhanced training efficiency.
Scaling Text-to-Image Applications: The paper validates the effectiveness of $\mu$ P in practical applications by scaling PixArt- $\alpha$ from 0.04B to 0.61B parameters, and MMDiT from 0.18B to 18B parameters. In both scenarios, $\mu$ P models outperformed their baselines with significantly lower tuning costs—just 5.5% and 3% of the corresponding full training costs, respectively.

Implications and Future Directions

The findings of this paper have significant implications for the field of generative modeling, especially as models continue to grow in size. Reducing the computational and time costs associated with hyperparameter tuning opens up pathways for more extensive and cost-efficient deployment of diffusion models, potentially revolutionizing applications in various domains such as art creation, virtual reality, and even synthetic data generation.

Theoretically, the paper lays the groundwork for future research into the scaling behaviors of diffusion transformers and the development of more sophisticated scaling laws that could further optimize large model training. Additionally, the robust results from applying $\mu$ P suggest that similar approaches could be explored for other types of models and tasks beyond vision, such as audio or multimodal synthesis.

Conclusion

Scaling Diffusion Transformers Efficiently via $\mu$ P introduces a practical and theoretically sound method for scaling diffusion transformers by leveraging the $\mu$ P framework. The research provides a pathway for efficient hyperparameter tuning across different model scales, enhancing training speed and reducing computational costs. This work will likely catalyze further innovation in efficient model scaling, with broader implications for the future of artificial intelligence and generative models.