Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Published 22 Jan 2026 in cs.CV | (2601.16208v1)

Abstract: Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this simplified framework, we conduct a controlled comparison of RAE against the state-of-the-art FLUX VAE across diffusion transformer scales from 0.5B to 9.8B parameters. RAEs consistently outperform VAEs during pretraining across all model scales. Further, during finetuning on high-quality datasets, VAE-based models catastrophically overfit after 64 epochs, while RAE models remain stable through 256 epochs and achieve consistently better performance. Across all experiments, RAE-based diffusion models demonstrate faster convergence and better generation quality, establishing RAEs as a simpler and stronger foundation than VAEs for large-scale T2I generation. Additionally, because both visual understanding and generation can operate in a shared representation space, the multimodal model can directly reason over generated latents, opening new possibilities for unified models.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that RAEs outperform VAEs by converging faster and delivering improved text-to-image generation performance.
The research employs high-dimensional latent spaces with frozen encoders and specialized decoder training across diverse datasets to boost model generalization.
The study reveals that scaling-induced architectural simplifications streamline diffusion transformers, paving the way for unified multimodal models.

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

The paper "Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders" (2601.16208) explores the implementation and advantages of Representation Autoencoders (RAEs) in the context of text-to-image (T2I) generation using diffusion transformers. This study primarily aims to investigate whether RAEs, which have shown promise in ImageNet-based diffusion modeling, can be successfully scaled for T2I tasks that encompass large-scale, diverse, and freeform data such as those found in web sources and text-rendering scenarios.

The research evaluates the relative performance of RAEs against traditional Variational Autoencoders (VAEs), explores the scalability of RAEs, and demonstrates modifications in model architecture and training protocols that facilitate improved performance in T2I tasks.

Methodology and Approach

The study emphasizes the use of high-dimensional semantic latent spaces in RAEs, which leverage a frozen representation encoder and a trained decoder to manage these spaces effectively. The frozen encoder, SigLIP-2, plays a crucial role in maintaining high-dimensional representations while a decoder is trained across datasets beyond ImageNet to enhance domain adaptability.

** The research follows key steps:**

Decoder Training Beyond ImageNet: It expands the training dataset to include web, synthetic, and text-rendering data, significantly improving the model's ability to generalize across diverse domains.
Simplification at Scale: The analysis suggests that scaling inherently simplifies the model, rendering some complex architectural elements from ImageNet preprocessing redundant.
Pretraining and Finetuning: A controlled comparison between RAEs and VAEs across diffusion transformer scales shows that RAEs consistently outperform VAEs in terms of convergence speed and eventual performance metrics.
Figure 1: RAE converges faster than VAE in text-to-image pretraining.

Key Results and Findings

The empirical evaluation within the research reveals several crucial insights and outcomes:

Convergence and Performance: RAEs demonstrate superior convergence rates, outperforming VAE counterparts with significant speed-up in both GenEval and DPG-Bench scores.
Data Generalization: By including web and synthetic data, the RAE models exhibit improved capacity to reconstruct text and produce high-quality image outputs beyond the performance of models trained solely on ImageNet.
Architectural Simplifications: With scaling, previously important architectural components such as wide diffusion heads and noise-augmented decoding provide diminishing returns, suggesting they were more crucial for smaller-scale applications than large-scale T2I.
Robustness to Overfitting: Compared to VAEs, RAE-based models are less prone to overfitting—sustaining high performance during extended training epochs in finetuning scenarios.
Figure 2: RAE decoders trained on more data generalize across domains, improving text reconstruction fidelity.

Implications for Unified Models

The shared representation space in RAEs enables both understanding and generation tasks, allowing the models to interpret generated latents directly without the necessity of pixel-level mapping. This capability provides exciting prospects for unified models where a single architectural setup can handle dual tasks of understanding and generation effectively.

Figure 3: Test-time scaling in latent space, showcasing the RAE framework's capability to evaluate and select generation results within the latent space.

Conclusion

Overall, this research posits RAEs as a robust and efficient alternative to traditional VAEs for large-scale text-to-image diffusion models. Not only do they offer improved convergence and performance, but their ability to operate in a semantically rich latent space also opens new avenues for unified multimodal models. By simplifying certain architectural components while retaining critical aspects like noise scheduling, RAEs pave the way for more scalable, efficient, and versatile models in the field of T2I generation. The study emphasizes that RAEs provide a theoretically sound and practically effective foundation for advancing future AI developments in multimodal representations.