High-Resolution Image Synthesis via Next-Token Prediction

Published 22 Nov 2024 in cs.CV, cs.AI, and cs.LG | (2411.14808v2)

Abstract: Recently, autoregressive models have demonstrated remarkable performance in class-conditional image generation. However, the application of next-token prediction to high-resolution text-to-image generation remains largely unexplored. In this paper, we introduce \textbf{D-JEPA$\cdot$T2I}, an autoregressive model based on continuous tokens that incorporates innovations in both architecture and training strategy to generate high-quality, photorealistic images at arbitrary resolutions, up to 4K. Architecturally, we adopt the denoising joint embedding predictive architecture (D-JEPA) while leveraging a multimodal visual transformer to effectively integrate textual and visual features. Additionally, we introduce flow matching loss alongside the proposed Visual Rotary Positional Embedding (VoPE) to enable continuous resolution learning. In terms of training strategy, we propose a data feedback mechanism that dynamically adjusts the sampling procedure based on statistical analysis and an online learning critic model. This encourages the model to move beyond its comfort zone, reducing redundant training on well-mastered scenarios and compelling it to address more challenging cases with suboptimal generation quality. For the first time, we achieve state-of-the-art high-resolution image synthesis via next-token prediction.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces D-JEPA·T2I, an autoregressive framework that leverages next-token prediction, VoPE, and flow matching loss for 4K high-resolution image synthesis.
It integrates a multimodal visual transformer to combine textual and visual features, enhancing image coherence and detail.
Experimental validations demonstrate that D-JEPA·T2I outperforms existing benchmarks, proving its efficiency and robustness in generating high-fidelity images.

High-Resolution Image Synthesis via Next-Token Prediction: An Overview

The paper presents a significant advancement in the domain of high-resolution image synthesis, particularly focusing on the application of next-token prediction. It introduces D-JEPA $\cdot$ T2I, an advanced autoregressive model extending the capabilities of the D-JEPA framework to efficiently manage high-resolution text-to-image (T2I) synthesis. The main architectural innovation lies in embedding a flow matching loss framework and employing a sophisticated multimodal visual transformer (MVT) alongside Visual Rotary Positional Embedding (VoPE), effectively addressing the inherent challenges in generating high-resolution images by autoregressive models.

Key Innovations

Multimodal Integration: The paper innovatively employs a multimodal visual transformer to integrate textual and visual features, enhancing the generative capacity for textual prompts. This is critical in maintaining the coherence and integrity of high-resolution synthesized images.
Visual Rotary Positional Embedding (VoPE): VoPE is specifically designed for vision models, improving continuous resolution learning by addressing challenges associated with varying image scales and aspect ratios. This positional encoding mechanism avoids issues found in sinusoidal positional encodings, which struggle with positional information consistency when images undergo operations like cropping or scaling.
Flow Matching Loss: The introduction of a flow matching loss as an alternative to traditional diffusion losses accelerates model convergence and enhances image quality significantly. By facilitating more efficient distribution modeling of tokenized image data, the proposed flow matching loss becomes pivotal in achieving high fidelity in generated images.
Data Feedback Mechanism: Enhancing data utilization through real-time feedback effectively addresses data bias in large-scale datasets. By continuously adjusting data distributions according to training performance, D-JEPA $\cdot$ T2I is capable of leveraging evolving data distributions, thereby optimizing the iterative training process and improving the model's robustness in high-resolution image generation.

Experimental Validation

The authors conducted thorough evaluations validating D-JEPA $\cdot$ T2I’s performance against established benchmarks like T2I-CompBench and GenEval, in addition to detailed human preference tests. The model notably outperformed previous frameworks in generating complex high-resolution imagery, with empirical results showcasing capabilities extending to 4K resolution.

Implications and Future Directions

This paper's methodology presents a compelling direction for enhancing autoregressive models' efficiency and effectiveness in image synthesis, rivalling the traditional diffusion models. Practically, leveraging an autoregressive architecture for T2I synthesis could lead to more resource-efficient training and higher throughput, potentially scaling to even larger, more complex datasets and image resolutions.

Theoretical advancements proposed by the paper, particularly VoPE and the flow matching loss, have broader implications for both representation learning and cross-modal applications in machine learning. Moving forward, exploration into integrating these methodologies in unified multimodal frameworks could unveil new potentials in video generation and real-time interactive applications.

In summary, "High-Resolution Image Synthesis via Next-Token Prediction" provides essential insights and tools that not only push the boundaries of image synthesis technologies but also lay down foundational strategies for future research and application in AI-driven image generation.