Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Published 21 Jan 2024 in cs.CV, cs.AI, and cs.LG | (2401.11605v1)

Abstract: We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (29)

View on Semantic Scholar

Summary

The paper introduces HDiT, a diffusion model that uses a hierarchical hourglass design to synthesize high-resolution images with linear scaling.
The model outperforms traditional methods on benchmarks like FFHQ-1024² and ImageNet-256², demonstrating superior quality and efficiency.
The approach eliminates the need for complex multiscale architectures, opening opportunities for applications in video generation and super-resolution.

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Introduction

The paper introduces the Hourglass Diffusion Transformer (HDiT), which tackles the challenge of generating high-quality, high-resolution images directly in pixel space using diffusion models. This work stands out due to its subquadratic scaling with pixel count, leveraging the hierarchical structure commonly seen in U-Nets while embracing the scalability of transformer architectures. As diffusion models have become pivotal in state-of-the-art image generation tasks, HDiT offers a compelling approach that negates the need for cumbersome techniques like multiscale architectures or latent autoencoders, which are typically required for training at high resolutions.

Methodology

HDiT builds upon the transformer architecture, adapting it to address the computational demands of high-resolution image synthesis. By organizing the transformer in a hierarchical, hourglass fashion, HDiT effectively handles the increasing complexity associated with larger pixel counts, achieving a linear computational complexity ( $\mathcal{O}(n)$ ) as opposed to the quadratic complexity ( $\mathcal{O}(n^2)$ ) seen in traditional transformer-based diffusion approaches.

Hierarchical Structure

Hourglass Architecture: The model employs a multi-level hierarchical structure, where each level processes data at increasingly coarser resolutions until a global representation is formed, followed by a symmetric decoding process. This approach is reminiscent of the U-Net architecture's downsampling and upsampling paths, but purely based on Transformer blocks.
Attention Mechanisms: HDiT utilizes localized attention at high resolutions and global attention at lower resolutions, optimizing the balance between computational efficiency and expressive power. Notably, Neighborhood Attention Transformers are used to maintain linear scaling, mitigating the quadratic growth in complexity typically associated with self-attention layers.
Figure 1: Samples from our 85M-parameter FFHQ-1024 $^2$ model. Best viewed zoomed in.

Implementation Details

Figure 1 provides visual evidence of the high-quality outputs achieved by the HDiT model. This is significant given the reduced computational demands compared to existing models.

Diffusion Process: The model adopts a Gaussian noising process with a denoising neural network, leveraging preconditioning and loss weighting strategies to enhance performance. This involves the use of EDM's preconditioning techniques and a novel instance of Min-SNR loss weighting, adapted for robust training across varying noise levels.
Scaling and Efficiency: HDiT's transformer layers are designed to scale linearly with the number of tokens, ensuring that even at megapixel resolutions, the model remains computationally feasible. This efficiency is a result of strategically limiting the use of computationally expensive global self-attention.

Results and Evaluation

The model's performance was evaluated on several benchmarks, including high-resolution datasets and standard image generation tasks.

FFHQ-1024 $^2$ Benchmark: HDiT achieved state-of-the-art results in high-resolution face synthesis, outperforming previous diffusion models both in terms of quality and computational efficiency.
ImageNet-256 $^2$ Benchmark: For class-conditional tasks, HDiT demonstrated competitive performance without relying on classifier-free guidance, indicating robust generalization capabilities.
Figure 2: Samples from our class-conditional 557M-parameter ImageNet-256 $^2$ model without classifier-free guidance.

Conclusion

HDiT proposes a significant advancement in high-resolution image synthesis, demonstrating that diffusion models can achieve exceptional quality without extensive architectural modifications or additional training tricks. By capitalizing on the hierarchical hourglass design, HDiT efficiently manages the challenges of scaling in both computational and expressive terms. Future work could explore the integration of HDiT with task-specific adaptations like self-conditioning, potentially enhancing performance further in specialized applications such as video generation or super-resolution tasks. The model opens up new possibilities for scalable, high-quality image synthesis directly in pixel space.

Markdown Report Issue