Multi-Reference Autoregression in Image Generation
- MRAR is a paradigm that combines AR transformers with diffusion models to generate high-quality and diverse images.
- It conditions its predictions on multiple previously generated image latents, enriching context and boosting synthesis performance.
- Empirical results show reduced FID and enhanced diversity on ImageNet benchmarks, highlighting its significance in advanced image generation.
Multi-Reference Autoregression (MRAR) is a paradigm introduced for autoregressive image generation within the TransDiff framework, which combines an autoregressive (AR) Transformer with a diffusion model. MRAR addresses the challenge of generating high-quality, diverse images by allowing the model to condition its predictions on multiple previously generated image latents, thereby enriching contextual information for each generation step. Within the hybrid architecture of TransDiff, MRAR has demonstrated significant improvements in both quantitative and qualitative metrics for image synthesis, particularly on the ImageNet 256×256 benchmark (Zhen et al., 11 Jun 2025).
1. Problem Statement and Formalism
Given a discrete class label (or global condition) and a sequence of previously generated image latents , the objective at generation step is to autoregressively produce the next image latent . Each reference latent resides in , with , denoting image height and width, the downsampling factor, and feature width.
The autoregressive factorization, central to MRAR, is formalized as:
In MRAR, each conditional is realized by composing an AR Transformer for contextual encoding, followed by a diffusion decoder:
- , where
- .
At inference, the input includes all prior latents as references and a "Mask" embedding for the target unknown.
2. System Architecture
Input Embeddings
- Class embedding: .
- Mask embedding: .
- Reference latents: for .
AR Transformer
- Stack Depth / Width: 24–40 layers, width 768–1280.
- Attention Mask: Causal (lower-triangular) over .
- Output: used for diffusion conditioning.
Diffusion Decoder
- DiT-style Transformer: Trained with rectified-flow (Flow Matching) objectives.
- Inputs: Concatenated and Gaussian noise .
- Sampling: Single-step or few-step Euler–Maruyama with two learned scale factors (, ).
Variants
- 1-Step AR: Feeds only with bidirectional attention—no past images.
- MRAR: Concatenates and causally attends over all prior image latents, providing richer cross-iteration context.
3. Training Objectives and Loss Functions
AR Feature Loss
A simple feature reconstruction loss is used for each token:
Diffusion (Flow-Matching) Loss
Rectified-flow loss is adopted, where for , , , with straight-line paths :
Joint Loss (1-Step AR regime)
MRAR Fine-Tuning
No additional loss terms are introduced for MRAR; training continues with the same , but with richer due to multiple references in the ARTransformer.
4. Generation Algorithm
The MRAR inference process is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Inputs: class token C, mask Mask_img, number of steps n Outputs: generated image R_img Initialize empty list X_imgs = [] For i in 0…n: if i == 0: input_seq = [C; Mask_img] else: input_seq = [C; Mask_img; X_imgs[0]; …; X_imgs[i–1]] c_i = ARTransformer(input_seq) # output features X_imgs.append(c_i) # treated as “latent image” c = Concat([c₀; c₁; …; cₙ]) O = DiffusionDecoder(c, ε) # O = [o₀; …; oₙ] R_img = oₙ # final generated image return R_img |
This regime contrasts with single-step AR (which lacks multiple references), enabling explicit causal conditioning and richer intermediate context.
5. Quantitative and Qualitative Results
Model Performance Table
| Model | FID ↓ | IS ↑ | Diversity Measure ↓ | Time ↓ |
|---|---|---|---|---|
| DiT-XL/2 (Diffusion only) | 2.27 | 278.2 | — | 45s |
| RAR-L (AR only) | 1.70 | 290.5 | — | 15s |
| TransDiff-L, 1-Step AR (Hybrid) | 1.69 | 282.0 | 0.44 | 0.2s |
| TransDiff-L, MRAR (Hybrid+MRAR) | 1.49 | 282.2 | 0.39 | 0.8s |
| TransDiff-H, MRAR (Hybrid+MRAR) | 1.42 | 301.2 | — | 1.6s |
- Diversity Measure: norm of cosine similarity matrix, lower implies greater output diversity.
- MRAR reduces FID by 0.20 compared to 1-Step AR and by 1.29 over Scale-Level AR.
- Human evaluations: 60 raters judged 200 images; MRAR outperformed Token-AR and Scale-AR in subject diversity, background diversity, and overall visual quality.
Ablation and Diversity
- As the number of references increases from , FID decreases until approximately 4 references, after which gains plateau.
- Feature fusion by mixing semantic features from distinct classes yields outputs blending those semantics, demonstrating AR feature representations encode high-level class structure.
6. Mechanisms and Intuitions
MRAR enhances the AR-diffusion pipeline by allowing the Transformer to causally attend over a sequence of fully-decoded image latents, providing a comprehensive context for subsequent image generation. This approach yields:
- Increased Diversity: AR features derived from multiple references promote diverse activations, as measured by reduced cosine similarity.
- Improved Class Coverage: Multi-modal conditioning enables the model to span a wider portion of the class manifold.
- Augmented Diffusion Decoding: Conditioning vectors aggregate multi-faceted semantic context, facilitating lower FID and higher IS. Empirical findings indicate that greater diversity in correlates with improved FID, establishing MRAR as a key factor in maximizing output diversity and quality.
7. Impact and Future Directions
The introduction of MRAR within TransDiff establishes a new paradigm for autoregressive image generation, demonstrating superior FID, semantic diversity, and inference speed compared to standard AR Transformer or diffusion-only architectures. MRAR's integration requires only a minimal architectural extension, suggesting applicability to other AR-diffusion hybrid models. Potential future research may explore scaling MRAR to even larger datasets, varying the depth of autoregressive references, and applying the approach to non-image modalities (Zhen et al., 11 Jun 2025).