D-AR: Diffusion via Autoregressive Models

Published 29 May 2025 in cs.CV | (2505.23660v1)

Abstract: This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with LLMs. Code and models will be available at https://github.com/showlab/D-AR

Abstract PDF Upgrade to Chat

Summary

The paper introduces D-AR, reinterpreting diffusion as an autoregressive token prediction task to streamline visual generation.
It employs a sequential diffusion tokenizer that converts images into discrete, coarse-to-fine token sequences, aligning with standard AR models.
Experiments on ImageNet 256×256 show competitive FID scores and efficient streaming previews, underscoring its practical impact.

Diffusion via Autoregressive Models: A Technical Overview

Introduction

The paper "D-AR: Diffusion via Autoregressive Models" presents an innovative approach to visual generation by integrating the strengths of diffusion models and autoregressive models (AR). Autoregressive models, particularly LLMs, have proven effective in sequential data processing, mainly demonstrated in NLP tasks. However, adapting the standard next-token prediction paradigm of AR models to visual synthesis has been challenging due to the lack of inherent linear structure in images. In contrast, diffusion models, which originate from noise to refine outputs through iterative denoising, excel in creating high-quality images but suffer from inherently slow sampling processes due to their dense operational nature.

To bridge these paradigms, the authors propose the D-AR framework, which reinterprets the image diffusion process as an autoregressive token prediction task. This approach employs a sequential diffusion tokenizer that converts the visual information into a sequence of discrete tokens. Each token in the sequence corresponds to different steps in the diffusion process. By ensuring these tokens follow a coarse-to-fine ordering, the method naturally aligns with the AR paradigm, enabling efficient visual generation while leveraging the advantages of diffusion models.

Methodology

Sequential Diffusion Tokenizer

The cornerstone of the D-AR framework is the sequential diffusion tokenizer. This component encodes images into discrete token sequences, which reflect progressive denoising steps. The tokenizer operates in two main phases: encoding and decoding.

During encoding, images are transformed into a sequence of discrete tokens via a transformer-based encoder followed by vector quantization. The decoding phase utilizes a diffusion transformer model to process these tokens, resembling the denoising pathway inherent in diffusion models. Importantly, the ordering of tokens in the sequence mirrors the progression from coarse, global semantics to fine, local details.

Figure 1: Sequential diffusion tokenizer structure. When training the tokenizer, the pixel diffusion transformer in the tokenizer decoder calculates the velocity loss with the selected group of tokens, $\mathbf{c}(t)$ , as conditioning tokens.

This linearized token sequence allows standard autoregressive models to predict upcoming tokens without alterations to standard autoregressive mechanisms. Thus, the D-AR methodology retains a vanilla AR architecture, simplifying integration with existing LLM infrastructures.

Autoregressive Modeling

Once tokens are arranged by the sequential diffusion tokenizer, they are processed using a decoder-only transformer model adhering strictly to the autoregressive paradigm. This process generates images by predicting tokens sequentially, performing diffusion denoising steps simultaneously as new tokens are produced.

Figure 2: Different paradigms incorporating diffusion and autoregressive models for vision generation. (a) uses patch-level diffusion during every single autoregressive step to tackle continuous outputs~\cite{mar,causalfusion,diffusionforcing}; ours use vanilla AR models, which can train with discrete inputs/outputs by simply cross entropy, and sequentially decode output tokens with our diffusion tokenizer.

The critical advantage of this setup is that it maintains a Markovian diffusion sampling procedure managed by AR-induced token conditioning, which enables several sophisticated properties including fast inference compatible with KV cache, streaming pixel decoding, and zero-shot layout-controlled synthesis by prefix token conditioning.

Experimental Results

The D-AR framework was assessed on the ImageNet $256\times256$ benchmark, showcasing its competitive performance. The authors utilized models with 343M and 775M parameters, achieving FID scores of 2.44 and 2.09 respectively, which are competitive against existing state-of-the-art methods.

Figure 3: Uncurated generated samples from D-AR-XL with 256\times256 resolutions (CFG=4.0).

The methodology demonstrated superior efficiency in generating consistent previews from partial token sequences, highlighting its suitability for practical applications in AI-driven image synthesis.

Figure 4: Consistent previews as generation trajectories for every increment of 32 tokens (a group). Note that these previews can be generated in a streaming manner with AR tokens partially generated.

Moreover, the framework's ability to condition specific layout tokens in a zero-shot manner underscores its versatility in generating controlled image layouts without additional training, promoting its use in adaptive and creative tasks.

Figure 5: Zero-shot layout-controlled synthesis.

Conclusion

The introduction of D-AR represents a pivotal step in advancing AI-driven visual synthesis by leveraging the intrinsic strengths of both diffusion models and autoregressive frameworks. By transforming diffusion processes into token sequences, this method ensures alignment with the AR paradigm while optimizing efficiency and capability in visual generation tasks. The framework not only excels in generating high-quality images but also simplifies integration with modern LLM architectures, potentially inspiring further research into unified multimodal systems. Future research could explore extending D-AR to encompass native text-to-image generation capabilities, broadening its applicability across AI disciplines.

Markdown Report Issue