- The paper introduces HDiT, a diffusion model that uses a hierarchical hourglass design to synthesize high-resolution images with linear scaling.
- The model outperforms traditional methods on benchmarks like FFHQ-1024² and ImageNet-256², demonstrating superior quality and efficiency.
- The approach eliminates the need for complex multiscale architectures, opening opportunities for applications in video generation and super-resolution.
Introduction
The paper introduces the Hourglass Diffusion Transformer (HDiT), which tackles the challenge of generating high-quality, high-resolution images directly in pixel space using diffusion models. This work stands out due to its subquadratic scaling with pixel count, leveraging the hierarchical structure commonly seen in U-Nets while embracing the scalability of transformer architectures. As diffusion models have become pivotal in state-of-the-art image generation tasks, HDiT offers a compelling approach that negates the need for cumbersome techniques like multiscale architectures or latent autoencoders, which are typically required for training at high resolutions.
Methodology
HDiT builds upon the transformer architecture, adapting it to address the computational demands of high-resolution image synthesis. By organizing the transformer in a hierarchical, hourglass fashion, HDiT effectively handles the increasing complexity associated with larger pixel counts, achieving a linear computational complexity (O(n)) as opposed to the quadratic complexity (O(n2)) seen in traditional transformer-based diffusion approaches.
Hierarchical Structure
Implementation Details
Figure 1 provides visual evidence of the high-quality outputs achieved by the HDiT model. This is significant given the reduced computational demands compared to existing models.
- Diffusion Process: The model adopts a Gaussian noising process with a denoising neural network, leveraging preconditioning and loss weighting strategies to enhance performance. This involves the use of EDM's preconditioning techniques and a novel instance of Min-SNR loss weighting, adapted for robust training across varying noise levels.
- Scaling and Efficiency: HDiT's transformer layers are designed to scale linearly with the number of tokens, ensuring that even at megapixel resolutions, the model remains computationally feasible. This efficiency is a result of strategically limiting the use of computationally expensive global self-attention.
Results and Evaluation
The model's performance was evaluated on several benchmarks, including high-resolution datasets and standard image generation tasks.
Conclusion
HDiT proposes a significant advancement in high-resolution image synthesis, demonstrating that diffusion models can achieve exceptional quality without extensive architectural modifications or additional training tricks. By capitalizing on the hierarchical hourglass design, HDiT efficiently manages the challenges of scaling in both computational and expressive terms. Future work could explore the integration of HDiT with task-specific adaptations like self-conditioning, potentially enhancing performance further in specialized applications such as video generation or super-resolution tasks. The model opens up new possibilities for scalable, high-quality image synthesis directly in pixel space.