Multi-Reference Autoregression in Image Generation

Updated 5 December 2025

MRAR is a paradigm that combines AR transformers with diffusion models to generate high-quality and diverse images.
It conditions its predictions on multiple previously generated image latents, enriching context and boosting synthesis performance.
Empirical results show reduced FID and enhanced diversity on ImageNet benchmarks, highlighting its significance in advanced image generation.

Multi-Reference Autoregression (MRAR) is a paradigm introduced for autoregressive image generation within the TransDiff framework, which combines an autoregressive (AR) Transformer with a diffusion model. MRAR addresses the challenge of generating high-quality, diverse images by allowing the model to condition its predictions on multiple previously generated image latents, thereby enriching contextual information for each generation step. Within the hybrid architecture of TransDiff, MRAR has demonstrated significant improvements in both quantitative and qualitative metrics for image synthesis, particularly on the ImageNet 256×256 benchmark (Zhen et al., 11 Jun 2025).

1. Problem Statement and Formalism

Given a discrete class label $C$ (or global condition) and a sequence of $k$ previously generated image latents $x_{\text{img}_0}, \dots, x_{\text{img}_{k-1}}$ , the objective at generation step $k$ is to autoregressively produce the next image latent $x_{\text{img}_k}$ . Each reference latent $x_{\text{img}_j}$ resides in $\mathbb{R}^{(h/f \times w/f) \times (d \cdot f^2)}$ , with $h$ , $w$ denoting image height and width, $f$ the downsampling factor, and $d$ feature width.

The autoregressive factorization, central to MRAR, is formalized as:

$p(x_{\text{img}_0}, \dots, x_{\text{img}_{K-1}} \mid C) = \prod_{i=0}^{K-1} p(x_{\text{img}_i} \mid C, x_{\text{img}_0}, \dots, x_{\text{img}_{i-1}})$

In MRAR, each conditional $p(x_{\text{img}_i} \mid C, x_{\text{img}_0}, \dots, x_{\text{img}_{i-1}})$ is realized by composing an AR Transformer for contextual encoding, followed by a diffusion decoder:

$x_{\text{img}_i} \sim \text{DiffusionDecoder}(c_i, \epsilon)$ , where
$c_i = \text{ARTransformer}([C; \text{Mask}; x_{\text{img}_0}; \dots; x_{\text{img}_{i-1}}])$ .

At inference, the input includes all $i$ prior latents as references and a "Mask" embedding for the target unknown.

2. System Architecture

Input Embeddings

Class embedding: $C \in \mathbb{R}^{64 \times d}$ .
Mask embedding: $\text{Mask} \in \mathbb{R}^{(h/f \times w/f) \times (d \cdot f^2)}$ .
Reference latents: $x_{\text{img}_j} \in \mathbb{R}^{(h/f \times w/f) \times (d \cdot f^2)}$ for $j < i$ .

AR Transformer

Stack Depth / Width: 24–40 layers, width 768–1280.
Attention Mask: Causal (lower-triangular) over $[C, \text{Mask}, x_{\text{img}_0}, ..., x_{\text{img}_{i-1}}]$ .
Output: $c_i$ used for diffusion conditioning.

Diffusion Decoder

DiT-style Transformer: Trained with rectified-flow (Flow Matching) objectives.
Inputs: Concatenated $[c_0; ...; c_{k-1}]$ and Gaussian noise $\epsilon$ .
Sampling: Single-step or few-step Euler–Maruyama with two learned scale factors ( $s_1$ , $s_2$ ).

Variants

1-Step AR: Feeds only $[C; \text{Mask}]$ with bidirectional attention—no past images.
MRAR: Concatenates and causally attends over all prior image latents, providing richer cross-iteration context.

3. Training Objectives and Loss Functions

AR Feature Loss

A simple feature reconstruction loss is used for each $x_{\text{img}}$ token:

$L_{\text{AR}} = \frac{1}{N}\sum_{n=0}^{N-1} LF(x_n, \text{ARTransformer}(x_0...x_{n-1}))$

Diffusion (Flow-Matching) Loss

Rectified-flow loss is adopted, where for $x \sim p_\text{data}$ , $\epsilon \sim \mathcal{N}(0, I)$ , $t \sim \text{Uniform}(0,1)$ , with straight-line paths $x^t = (1-t)x + t \epsilon$ :

$L_{\text{diff}} = \mathbb{E}_{t, x, \epsilon, C} \left\| (\epsilon - x) - \psi_\theta(x^t, t, c) \right\|^2$

Joint Loss (1-Step AR regime)

$L_\text{total} = L_\text{AR} + L_\text{diff}$

MRAR Fine-Tuning

No additional loss terms are introduced for MRAR; training continues with the same $L_{\text{diff}}$ , but with richer $c$ due to multiple references in the ARTransformer.

4. Generation Algorithm

The MRAR inference process is as follows:

Inputs: class token C, mask Mask_img, number of steps n
Outputs: generated image R_img

Initialize empty list X_imgs = []
For i in 0…n:
    if i == 0:
        input_seq = [C; Mask_img]
    else:
        input_seq = [C; Mask_img; X_imgs[0]; …; X_imgs[i–1]]
    c_i = ARTransformer(input_seq)     # output features
    X_imgs.append(c_i)                 # treated as “latent image”

c = Concat([c₀; c₁; …; cₙ])
O = DiffusionDecoder(c, ε)             # O = [o₀; …; oₙ]
R_img = oₙ                             # final generated image
return R_img

This regime contrasts with single-step AR (which lacks multiple references), enabling explicit causal conditioning and richer intermediate context.

5. Quantitative and Qualitative Results

Model Performance Table

Model	FID ↓	IS ↑	Diversity Measure ↓	Time ↓
DiT-XL/2 (Diffusion only)	2.27	278.2	—	45s
RAR-L (AR only)	1.70	290.5	—	15s
TransDiff-L, 1-Step AR (Hybrid)	1.69	282.0	0.44	0.2s
TransDiff-L, MRAR (Hybrid+MRAR)	1.49	282.2	0.39	0.8s
TransDiff-H, MRAR (Hybrid+MRAR)	1.42	301.2	—	1.6s

Diversity Measure: $L_1$ norm of cosine similarity matrix, lower implies greater output diversity.
MRAR reduces FID by 0.20 compared to 1-Step AR and by 1.29 over Scale-Level AR.
Human evaluations: 60 raters judged 200 images; MRAR outperformed Token-AR and Scale-AR in subject diversity, background diversity, and overall visual quality.

Ablation and Diversity

As the number of references increases from $0 \rightarrow 4 \rightarrow 16 \rightarrow 64$ , FID decreases until approximately 4 references, after which gains plateau.
Feature fusion by mixing semantic features from distinct classes yields outputs blending those semantics, demonstrating AR feature representations encode high-level class structure.

6. Mechanisms and Intuitions

MRAR enhances the AR-diffusion pipeline by allowing the Transformer to causally attend over a sequence of fully-decoded image latents, providing a comprehensive context for subsequent image generation. This approach yields:

Increased Diversity: AR features derived from multiple references promote diverse activations, as measured by reduced cosine similarity.
Improved Class Coverage: Multi-modal conditioning enables the model to span a wider portion of the class manifold.
Augmented Diffusion Decoding: Conditioning vectors $c$ aggregate multi-faceted semantic context, facilitating lower FID and higher IS. Empirical findings indicate that greater diversity in $c$ correlates with improved FID, establishing MRAR as a key factor in maximizing output diversity and quality.

7. Impact and Future Directions

The introduction of MRAR within TransDiff establishes a new paradigm for autoregressive image generation, demonstrating superior FID, semantic diversity, and inference speed compared to standard AR Transformer or diffusion-only architectures. MRAR's integration requires only a minimal architectural extension, suggesting applicability to other AR-diffusion hybrid models. Potential future research may explore scaling MRAR to even larger datasets, varying the depth of autoregressive references, and applying the approach to non-image modalities (Zhen et al., 11 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Reference Autoregression (MRAR).

Multi-Reference Autoregression in Image Generation

1. Problem Statement and Formalism

2. System Architecture

Input Embeddings

AR Transformer

Diffusion Decoder

Variants

3. Training Objectives and Loss Functions

AR Feature Loss

Diffusion (Flow-Matching) Loss

Joint Loss (1-Step AR regime)

MRAR Fine-Tuning

4. Generation Algorithm

5. Quantitative and Qualitative Results

Model Performance Table

Ablation and Diversity

6. Mechanisms and Intuitions

7. Impact and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multi-Reference Autoregression in Image Generation

1. Problem Statement and Formalism

2. System Architecture

Input Embeddings

AR Transformer

Diffusion Decoder

Variants

3. Training Objectives and Loss Functions

AR Feature Loss

Diffusion (Flow-Matching) Loss

Joint Loss (1-Step AR regime)

MRAR Fine-Tuning

4. Generation Algorithm

5. Quantitative and Qualitative Results

Model Performance Table

Ablation and Diversity

6. Mechanisms and Intuitions

7. Impact and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research