StyleSwin: Transformer-based GAN for High-resolution Image Generation

Published 20 Dec 2021 in cs.CV | (2112.10762v2)

Abstract: Despite the tantalizing success in a broad of vision tasks, transformers have not yet demonstrated on-par ability as ConvNets in high-resolution image generative modeling. In this paper, we seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis. To this end, we believe that local attention is crucial to strike the balance between computational efficiency and modeling capacity. Hence, the proposed generator adopts Swin transformer in a style-based architecture. To achieve a larger receptive field, we propose double attention which simultaneously leverages the context of the local and the shifted windows, leading to improved generation quality. Moreover, we show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality. The proposed StyleSwin is scalable to high resolutions, with both the coarse geometry and fine structures benefit from the strong expressivity of transformers. However, blocking artifacts occur during high-resolution synthesis because performing the local attention in a block-wise manner may break the spatial coherency. To solve this, we empirically investigate various solutions, among which we find that employing a wavelet discriminator to examine the spectral discrepancy effectively suppresses the artifacts. Extensive experiments show the superiority over prior transformer-based GANs, especially on high resolutions, e.g., 1024x1024. The StyleSwin, without complex training strategies, excels over StyleGAN on CelebA-HQ 1024, and achieves on-par performance on FFHQ-1024, proving the promise of using transformers for high-resolution image generation. The code and models will be available at https://github.com/microsoft/StyleSwin.

Abstract PDF Upgrade to Chat

Authors (8)

Citations (196)

View on Semantic Scholar

Summary

The paper introduces a transformer-based GAN that leverages Swin transformers and a double attention mechanism to efficiently generate high-resolution images.
It incorporates a style-based architecture with local-global positional encoding and a wavelet discriminator to produce artifact-free outputs.
Results demonstrate that StyleSwin surpasses StyleGAN on the CelebA-HQ dataset at 1024x1024 resolution with an FID of 4.43, highlighting its robustness.

StyleSwin: Transformer-Based GAN for High-Resolution Image Generation

Overview

The paper "StyleSwin: Transformer-based GAN for High-resolution Image Generation" introduces a novel approach leveraging pure transformers to develop a generative adversarial network (GAN) for high-resolution image synthesis. Traditionally, convolutional networks (ConvNets) have dominated in image generative modeling, particularly with architectures like StyleGAN. However, this paper explores the potential of transformers, particularly the Swin transformer, to enhance high-resolution image generation by overcoming the computational challenges typically associated with transformers.

Methodology

The authors present several key innovations within the StyleSwin model:

Local Attention with Swin Transformer: The use of Swin transformers allows for window-based local attention, balancing computational efficiency with modeling capacity. This approach mitigates the quadratic cost typically associated with transformer attentions, permitting scalability to resolutions as high as 1024x1024 pixels.
Double Attention Mechanism: To capture a wider context without excessive computation, a double attention mechanism is employed. It processes both local and shifted windows, expanding the transformer’s receptive field efficiently.
Style-Based Architecture: Inspired by StyleGAN, StyleSwin incorporates a style-based architecture, using the $\mathcal{W}$ space to modulate feature maps effectively, significantly enhancing generation capacity.
Local-Global Positional Encoding: The authors address limitations in positional awareness by integrating sinusoidal positional encoding alongside relative positional encoding, helping the model leverage global position information effectively.
Wavelet Discriminator: To suppress blocking artifacts in high-resolution synthesis, a wavelet discriminator examines the spectral discrepancies, effectively guiding the generator to produce artifact-free outputs.

Results

StyleSwin achieves competitive results compared to state-of-the-art GANs, particularly on high-resolution datasets. Notably, on the CelebA-HQ dataset at 1024x1024 resolution, StyleSwin surpasses StyleGAN with an FID of 4.43. The method performs competently on other datasets like FFHQ and LSUN Church, demonstrating its robustness across varied data.

Implications and Future Research

The implications of this research are significant in advancing high-resolution image generation with transformers. By integrating transformers in the generator's architecture, the model demonstrates improved expressivity and ability to capture complex dependencies over large image scales. These findings pose interesting avenues for further research, especially in exploring transformer capabilities for other generative tasks and potential optimizations for increased efficiency.

Future research could explore enhancing the attention mechanisms for better locality and global coherence and exploring further architectural refinements to streamline transformer operations in the context of GANs. Additionally, tackling the challenges related to training dynamics and data requirements of transformers compared to ConvNets presents a valuable direction for exploration.

The integration of more sophisticated discriminators to guide artifact-free synthesis while maintaining computational feasibility could further enhance the applicability and effectiveness of transformers in generative modeling.

In summary, "StyleSwin" presents a compelling step forward in leveraging transformer architectures for high-resolution image generation, offering promising results and laying the groundwork for ongoing advancements in the field of generative adversarial networks.