Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Reference Autoregression in Image Generation

Updated 5 December 2025
  • MRAR is a paradigm that combines AR transformers with diffusion models to generate high-quality and diverse images.
  • It conditions its predictions on multiple previously generated image latents, enriching context and boosting synthesis performance.
  • Empirical results show reduced FID and enhanced diversity on ImageNet benchmarks, highlighting its significance in advanced image generation.

Multi-Reference Autoregression (MRAR) is a paradigm introduced for autoregressive image generation within the TransDiff framework, which combines an autoregressive (AR) Transformer with a diffusion model. MRAR addresses the challenge of generating high-quality, diverse images by allowing the model to condition its predictions on multiple previously generated image latents, thereby enriching contextual information for each generation step. Within the hybrid architecture of TransDiff, MRAR has demonstrated significant improvements in both quantitative and qualitative metrics for image synthesis, particularly on the ImageNet 256×256 benchmark (Zhen et al., 11 Jun 2025).

1. Problem Statement and Formalism

Given a discrete class label CC (or global condition) and a sequence of kk previously generated image latents ximg0,,ximgk1x_{\text{img}_0}, \dots, x_{\text{img}_{k-1}}, the objective at generation step kk is to autoregressively produce the next image latent ximgkx_{\text{img}_k}. Each reference latent ximgjx_{\text{img}_j} resides in R(h/f×w/f)×(df2)\mathbb{R}^{(h/f \times w/f) \times (d \cdot f^2)}, with hh, ww denoting image height and width, ff the downsampling factor, and dd feature width.

The autoregressive factorization, central to MRAR, is formalized as:

p(ximg0,,ximgK1C)=i=0K1p(ximgiC,ximg0,,ximgi1)p(x_{\text{img}_0}, \dots, x_{\text{img}_{K-1}} \mid C) = \prod_{i=0}^{K-1} p(x_{\text{img}_i} \mid C, x_{\text{img}_0}, \dots, x_{\text{img}_{i-1}})

In MRAR, each conditional p(ximgiC,ximg0,,ximgi1)p(x_{\text{img}_i} \mid C, x_{\text{img}_0}, \dots, x_{\text{img}_{i-1}}) is realized by composing an AR Transformer for contextual encoding, followed by a diffusion decoder:

  • ximgiDiffusionDecoder(ci,ϵ)x_{\text{img}_i} \sim \text{DiffusionDecoder}(c_i, \epsilon), where
  • ci=ARTransformer([C;Mask;ximg0;;ximgi1])c_i = \text{ARTransformer}([C; \text{Mask}; x_{\text{img}_0}; \dots; x_{\text{img}_{i-1}}]).

At inference, the input includes all ii prior latents as references and a "Mask" embedding for the target unknown.

2. System Architecture

Input Embeddings

  • Class embedding: CR64×dC \in \mathbb{R}^{64 \times d}.
  • Mask embedding: MaskR(h/f×w/f)×(df2)\text{Mask} \in \mathbb{R}^{(h/f \times w/f) \times (d \cdot f^2)}.
  • Reference latents: ximgjR(h/f×w/f)×(df2)x_{\text{img}_j} \in \mathbb{R}^{(h/f \times w/f) \times (d \cdot f^2)} for j<ij < i.

AR Transformer

  • Stack Depth / Width: 24–40 layers, width 768–1280.
  • Attention Mask: Causal (lower-triangular) over [C,Mask,ximg0,...,ximgi1][C, \text{Mask}, x_{\text{img}_0}, ..., x_{\text{img}_{i-1}}].
  • Output: cic_i used for diffusion conditioning.

Diffusion Decoder

  • DiT-style Transformer: Trained with rectified-flow (Flow Matching) objectives.
  • Inputs: Concatenated [c0;...;ck1][c_0; ...; c_{k-1}] and Gaussian noise ϵ\epsilon.
  • Sampling: Single-step or few-step Euler–Maruyama with two learned scale factors (s1s_1, s2s_2).

Variants

  • 1-Step AR: Feeds only [C;Mask][C; \text{Mask}] with bidirectional attention—no past images.
  • MRAR: Concatenates and causally attends over all prior image latents, providing richer cross-iteration context.

3. Training Objectives and Loss Functions

AR Feature Loss

A simple feature reconstruction loss is used for each ximgx_{\text{img}} token:

LAR=1Nn=0N1LF(xn,ARTransformer(x0...xn1))L_{\text{AR}} = \frac{1}{N}\sum_{n=0}^{N-1} LF(x_n, \text{ARTransformer}(x_0...x_{n-1}))

Diffusion (Flow-Matching) Loss

Rectified-flow loss is adopted, where for xpdatax \sim p_\text{data}, ϵN(0,I)\epsilon \sim \mathcal{N}(0, I), tUniform(0,1)t \sim \text{Uniform}(0,1), with straight-line paths xt=(1t)x+tϵx^t = (1-t)x + t \epsilon:

Ldiff=Et,x,ϵ,C(ϵx)ψθ(xt,t,c)2L_{\text{diff}} = \mathbb{E}_{t, x, \epsilon, C} \left\| (\epsilon - x) - \psi_\theta(x^t, t, c) \right\|^2

Joint Loss (1-Step AR regime)

Ltotal=LAR+LdiffL_\text{total} = L_\text{AR} + L_\text{diff}

MRAR Fine-Tuning

No additional loss terms are introduced for MRAR; training continues with the same LdiffL_{\text{diff}}, but with richer cc due to multiple references in the ARTransformer.

4. Generation Algorithm

The MRAR inference process is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Inputs: class token C, mask Mask_img, number of steps n
Outputs: generated image R_img

Initialize empty list X_imgs = []
For i in 0n:
    if i == 0:
        input_seq = [C; Mask_img]
    else:
        input_seq = [C; Mask_img; X_imgs[0]; ; X_imgs[i1]]
    c_i = ARTransformer(input_seq)     # output features
    X_imgs.append(c_i)                 # treated as “latent image”

c = Concat([c; c; ; cₙ])
O = DiffusionDecoder(c, ε)             # O = [o₀; …; oₙ]
R_img = oₙ                             # final generated image
return R_img

This regime contrasts with single-step AR (which lacks multiple references), enabling explicit causal conditioning and richer intermediate context.

5. Quantitative and Qualitative Results

Model Performance Table

Model FID IS Diversity Measure ↓ Time ↓
DiT-XL/2 (Diffusion only) 2.27 278.2 45s
RAR-L (AR only) 1.70 290.5 15s
TransDiff-L, 1-Step AR (Hybrid) 1.69 282.0 0.44 0.2s
TransDiff-L, MRAR (Hybrid+MRAR) 1.49 282.2 0.39 0.8s
TransDiff-H, MRAR (Hybrid+MRAR) 1.42 301.2 1.6s
  • Diversity Measure: L1L_1 norm of cosine similarity matrix, lower implies greater output diversity.
  • MRAR reduces FID by 0.20 compared to 1-Step AR and by 1.29 over Scale-Level AR.
  • Human evaluations: 60 raters judged 200 images; MRAR outperformed Token-AR and Scale-AR in subject diversity, background diversity, and overall visual quality.

Ablation and Diversity

  • As the number of references increases from 0416640 \rightarrow 4 \rightarrow 16 \rightarrow 64, FID decreases until approximately 4 references, after which gains plateau.
  • Feature fusion by mixing semantic features from distinct classes yields outputs blending those semantics, demonstrating AR feature representations encode high-level class structure.

6. Mechanisms and Intuitions

MRAR enhances the AR-diffusion pipeline by allowing the Transformer to causally attend over a sequence of fully-decoded image latents, providing a comprehensive context for subsequent image generation. This approach yields:

  • Increased Diversity: AR features derived from multiple references promote diverse activations, as measured by reduced cosine similarity.
  • Improved Class Coverage: Multi-modal conditioning enables the model to span a wider portion of the class manifold.
  • Augmented Diffusion Decoding: Conditioning vectors cc aggregate multi-faceted semantic context, facilitating lower FID and higher IS. Empirical findings indicate that greater diversity in cc correlates with improved FID, establishing MRAR as a key factor in maximizing output diversity and quality.

7. Impact and Future Directions

The introduction of MRAR within TransDiff establishes a new paradigm for autoregressive image generation, demonstrating superior FID, semantic diversity, and inference speed compared to standard AR Transformer or diffusion-only architectures. MRAR's integration requires only a minimal architectural extension, suggesting applicability to other AR-diffusion hybrid models. Potential future research may explore scaling MRAR to even larger datasets, varying the depth of autoregressive references, and applying the approach to non-image modalities (Zhen et al., 11 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Reference Autoregression (MRAR).