Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

Published 2 Dec 2024 in cs.CV | (2412.01819v4)

Abstract: This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~32% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents a novel scale-wise autoregressive transformer that generates images progressively, aligning with human visual perception.
It integrates a hierarchical VAE, non-causal self-attention, and classifier-free guidance adjustments to enhance efficiency and training stability.
Empirical results demonstrate competitive image quality with sampling up to seven times faster than state-of-the-art diffusion models.

Analysis of "Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis"

The paper "Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis" presents an innovative methodology for text-to-image (T2I) generation. The authors introduce Switti, a scale-wise autoregressive transformer model that significantly accelerates the T2I generation process while maintaining competitive performance with state-of-the-art diffusion models. This work contributes to the ongoing exploration of autoregressive models' applicability in visual content generation, specifically addressing limitations found in previous approaches and providing novel solutions to improve efficiency and quality.

Technical Overview

Switti is built upon scale-wise autoregressive modeling, where images are generated progressively scale by scale, as opposed to the conventional next-token or masked autoregressive methods. The model employs a hierarchical VAE, namely RQ-VAE, which transforms an image into a hierarchical sequence of scales, allowing the transformer to predict latent tokens in increasing resolutions from a low-resolution starting point. This approach aligns with the inductive biases of human visual perception, facilitating coarse-to-fine image synthesis.

The authors also introduce several architectural innovations:

Non-Causal Self-Attention: By removing causality from the attention mechanism when processing scales, Switti reduces computational costs during sampling. This adjustment allows the model to optimize memory usage and inference time by refraining from storing extensive key-value caches used in traditional causal transformers.
Classifier-Free Guidance Adjustment: While classifier-free guidance (CFG) is pivotal for aligning generated images with text prompts, its application at higher resolutions often proves redundant and diminishes the quality of fine details. Switti disables CFG at the highest scales, resulting in a notable acceleration in sampling speed without compromising image quality.
Transformer Advances: The introduction of modifications such as structure normalizations and replacement of GELU activations with SwiGLU contribute to the stabilization of training and improved dynamic control of information processing.

Empirical Evaluation

Switti is assessed against both autoregressive and diffusion-based models, displaying substantial improvements in terms of efficiency. Its sampling is up to seven times faster than current diffusion counterparts while maintaining comparable visual quality and text alignment. This makes Switti particularly suitable for applications requiring real-time or high-volume image generation.

Human evaluation studies further substantiate Switti's capability to produce aesthetically pleasing images with fewer defects compared to alternative models. The work's reliance on extensive automated metrics corroborates the qualitative findings, demonstrating its robustness across various image quality benchmarks.

Implications and Future Directions

Switti's contributions underscore the potential of autoregressive models in visual domains, challenging the dominion of diffusion models by delivering similar quality more efficiently. This opens avenues for further refinement of scale-wise autoregressive frameworks, especially in the exploration of improved image tokenizers or hybrid models that integrate continuous diffusion priors.

The authors recognize current limitations regarding resolution capacity and hierarchical VAE performance, suggesting potential improvements through enhanced training data or more sophisticated architectures. The development of more effective hierarchical tokenizers could greatly enhance fidelity and detail in generated images, further bridging the gap with state-of-the-art diffusion techniques.

Future research may explore these aspects, providing broader implications on scale-wise modeling's role in general artificial intelligence applications across multiple domains including video generation and 3D content creation. Switti not only demonstrates the efficacy of its novel design but also sets a foundational benchmark for subsequent advancements in the T2I synthesis.

Markdown Report Issue