PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Published 30 Sep 2023 in cs.CV | (2310.00426v3)

Abstract: The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-LLM to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

Abstract PDF Upgrade to Chat

Citations (233)

View on Semantic Scholar

Summary

The paper proposes a decomposed training strategy for T2I synthesis by splitting pixel dependency learning, text-image alignment, and aesthetic refinement.
The paper achieves competitive photorealistic image synthesis with only 12% of the training time compared to Stable Diffusion, recording a zero-shot FID of 7.32 on COCO.
The paper integrates cross-attention and advanced auto-labeling techniques to lower computational costs and environmental impact without sacrificing output quality.

Overview of PixArt- $: Efficient Diffusion Transformers for Photorealistic Text-to-Image Synthesis</h2> <p>The paper introduces PixArt-$ , a Transformer-based diffusion model designed for photorealistic text-to-image (T2I) synthesis. The innovation primarily lies in achieving a quality of image generation that matches or surpasses the current state-of-the-art methods, such as Stable Diffusion or Imagen, while significantly reducing the computational demands and associated emissions typically required for training large-scale deep learning models.

Significant emphasis is placed on addressing the training cost and environmental footprint of existing generative models, where the authors propose a methodological shift in the training paradigm. The PixArt-$ model achieves competitive results with only 12% of the training time required by prior models like Stable Diffusion v1.5 and at a fraction of the cost of larger models such as RAPHAEL, positioning itself as an economically feasible alternative for academic and entrepreneurial ventures.

Core Contributions

Training Strategy Decomposition:

The T2I task is decomposed into three subproblems: - Pixel Dependency Learning: Focuses on learning the intrinsic structure of natural images, initialized with a class-condition model. - Text-Image Alignment Learning: Aligns text descriptions with image content using data with high concept density. - High Aesthetic Quality Synthesis: Fine-tunes the model with aesthetically superior data to enhance visual quality.

Efficient T2I Transformer: The technical architecture adapts the Diffusion Transformer (DiT) by incorporating cross-attention layers for textual information infusion, re-parameterizing to leverage ImageNet-pretrained weights, and optimizing parameter usage with adaLN-single, reducing computational cost while maintaining model performance.
High-Informative Data: To improve efficiency, they employ advanced auto-labeling techniques using the LLaVA model to create text-image pairs with rich semantic content and address data quality limitations in existing datasets.

Experimental Analysis

The model demonstrates superior performance across several benchmarks:

Fidelity and Alignment: Achieves a zero-shot FID score of 7.32 on the COCO dataset, performing robustly compared to other top models.
Compositional Capabilities: Excels in T2I-CompBench metrics including attribute binding and object relationships, underscoring effective text-image alignment capabilities.

Despite using a more restrained dataset and a streamlined training process, user evaluations further corroborate its state-of-the-art synthesis quality, showcasing significant preference over established models like SDXL, especially in maintaining semantic alignment with prompts.

Technical Implications and Future Work

PixArt- $serves as a significant step in balancing the trade-off between resource-heavy model training and image generation quality, highlighting the potential of architectural and training innovations to improve efficiency. The demonstrated reduction in both financial and environmental costs extends an invitation to further explore similar advancements in generative modeling, suggesting a broader industry shift towards sustainable AI development.</p> <p>Future research might focus on enhancing specific capabilities of the model, such as handling detailed object interactions and generating distinct textual elements, areas which the current paper acknowledges as limitations. The opportunity also lies in exploring further integrations of PixArt-$ within customized generation frameworks, exemplified by DreamBooth and ControlNet enhancements, which could broaden its applicability across diverse visual domains.

In conclusion, PixArt-$ not only introduces a competitive generative model in terms of performance and efficiency but also paves the way for responsible AI research and development that aligns with environmental sustainability goals. This work is seminal in its illustration of how strategic design innovations in model architecture and training methodologies can produce impactful advancements in AI with reduced resource expenditure.