Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction?

Published 11 Nov 2025 in cs.CV and cs.LG | (2511.08704v1)

Abstract: This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation quality measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed 32x32 resolution alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that optimal next-pixel prediction in vision requires a significantly higher token-to-parameter ratio and compute resource, needing up to 10-20 times more data than language models.
It shows that image generation quality benefits from a steeper dataset growth than image classification, highlighting distinct scaling dependencies for different tasks.
It reveals that higher image resolutions demand accelerated model size growth relative to data volume, indicating that larger models become more effective than simply increasing data.

Scaling the Next-Pixel Prediction Paradigm: An Analytical Perspective

Introduction to Next-Pixel Prediction

The paper "Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction?" (2511.08704) provides a comprehensive analysis of the scaling properties of autoregressive next-pixel prediction in vision models. By training Transformer models on 32 $\times$ 32 images, the study investigates the optimal scaling strategies across three target metrics: next-pixel prediction loss, ImageNet classification accuracy, and Fréchet Distance for image generation quality.

This analysis is motivated by contrasting the scaling trends of visual modeling with those of well-established LLMs. Learning from raw pixels poses unique challenges due to their low semantic content and complex spatial dependencies. Despite these challenges, this research identifies scaling trends consistent with those found in natural language processing.

Figure 1: Key scaling properties of next-pixel prediction, showing a higher optimal token-to-parameter ratio for pixels compared to language tokens.

Optimal Scaling and Its Dependencies

The research identifies critical dependencies of optimal scaling strategies on specific tasks. Remarkably, image generation quality requires a larger dataset growth rate than image classification, underscoring the divergent resource allocations necessary for distinct tasks within computer vision.

The results depict that training on raw pixels requires a significantly higher optimal token-to-parameter ratio than language token learning—up to 10-20 times more data. This necessitates a higher compute resource allocation to achieve parity with LLMs, delineating a key bottleneck in visual model scalability.

Figure 2: Predicted scaling properties of next-pixel prediction across key metrics and resolutions, emphasizing divergence in data and model scaling needs.

Effects of Image Resolution

The paper also explores how scaling strategies must adapt as image resolution increases. Detailed comparisons reveal that higher resolution demands an accelerated model size growth relative to data size. This shift indicates that, as image resolution increases, larger models become more effective than merely increasing data volume.

The complexity inherent to higher resolutions results in enriched data content per image, thereby modifying the balance between data and model scaling. This aligns with the intuition that high-resolution images encapsulate rich and intricate structures, necessitating more sophisticated model architectures.

Figure 3: Changes in optimal model and data scaling predictions versus FLOPs across different resolutions, highlighting the shift towards larger models at higher resolutions.

Feasibility of Next-Pixel Modeling

The projection within the study intimates that, while the computational demands are currently prohibitive, next-pixel prediction could reach competitive performance metrics in ImageNet classification and image generation within forthcoming years. This is contingent on expected annual compute growth rates of 4-5 times, signaling a future readiness for scaling raw pixel-based models.

The study thereby offers a forward-looking stance on the feasibility of achieving advanced capabilities in vision models using end-to-end raw pixel learning, predicated on continued growth in available computational resources.

Conclusion

This paper provides critical insights into the computational intricacies of scaling next-pixel prediction models. It underscores the remarkably higher data requirements relative to LLMs and proposes task-dependent, resolution-sensitive scaling strategies as a solution. The broader implication is a pivotal shift, identifying compute as a bottleneck rather than data, framing pixel-based autoregressive modeling as a viable frontier in visual generative model research. Future work can extend this foundation by exploring efficiency improvements and integrating alternative training objectives to harmonize performance across diverse visual tasks.