Papers
Topics
Authors
Recent
Search
2000 character limit reached

Next-Scale Prediction Paradigm

Updated 20 January 2026
  • Next-Scale Prediction Paradigm is a generalization of autoregressive modeling that predicts entire representations at multiple resolutions using a coarse-to-fine factorization.
  • It employs dense intra-scale and causal inter-scale dependencies to overcome inefficiencies of next-token prediction in structured data like images, graphs, and 3D point clouds.
  • Applications span image and video generation, graph and point cloud synthesis, and efficient forecasting in LLMs, with significant speedup and fidelity improvements.

The next-scale prediction paradigm is a generalization of autoregressive modeling in which the model predicts entire representations at progressively finer scales rather than the next atomic token. This approach was introduced to address inefficiencies and limitations of standard next-token prediction in domains where data exhibit inherent multi-scale structure, such as images, graphs, 3D point sets, audio, and more. By leveraging multi-scale hierarchies, next-scale prediction enables efficient training and sampling, improved permutation invariance, and enhanced fidelity for both global and local structure.

1. Mathematical Foundations and General Formalism

At the core of next-scale prediction is the coarse-to-fine factorization of the joint data distribution. Let {r1,r2,,rK}\{r_1, r_2, \dots, r_K\} denote a sequence of representations at increasing spatial, semantic, or temporal resolutions (“scales”), with rKr_K being the finest. The joint likelihood is factorized as

p(r1,r2,,rK)=k=1Kp(rkr<k),p(r_1, r_2, \ldots, r_K) = \prod_{k=1}^K p(r_k \mid r_{<k}),

where r<k=(r1,,rk1)r_{<k} = (r_1, \dots, r_{k-1}) is the collection of all coarser-scale states. In practical implementations, each rkr_k may correspond to an entire quantized image, a subgraph, a motif partition, an occupancy grid, or a point cloud at resolution kk (Tian et al., 2024, Meng et al., 7 Oct 2025, Belkadi et al., 30 Mar 2025).

Within each scale, dense, bidirectional attention is permitted among the elements (e.g., image patches or graph nodes) of rkr_k, while only uni-directional (causal) dependencies are imposed across scales. This design preserves structural locality and enables the fast, permutation-invariant modeling of sets, grids, or other non-sequential data.

2. Model Architectures and Tokenization Strategies

Visual and 3D Applications

In Visual AutoRegressive models (VAR), next-scale prediction is operationalized via a multi-scale vector-quantized VAE tokenizer which encodes the input into a hierarchy of discrete maps {r1,...,rK}\{r_1, ..., r_K\} at resolutions {hk×wk}\{h_k \times w_k\} (Tian et al., 2024). The transformer decoder is conditioned on all coarser-scale codes, generating each finer resolution map in parallel, with AdaLN or similar normalization.

For 3D point clouds and hypergraphs, a multi-resolution coarsening algorithm constructs a hierarchy of representation (e.g., Level-of-Detail (LoD) via farthest point sampling for point clouds), enabling each scale XkX_k (or hypergraph B(k)B^{(k)}) to be generated by attending to the coarser structure (Meng et al., 7 Oct 2025, Gailhard et al., 2 Jun 2025).

Graph and Language Domains

For graphs, as in MAG, the input is tokenized into a sequence of discrete adjacency/feature tensors at different scales via graph coarsening and quantization, with a decoder-only transformer autoregressively generating each map (Belkadi et al., 30 Mar 2025). In natural language, next-semantic-scale paradigms such as HDLM introduce a hierarchical vocabulary, diffusing tokens into increasingly coarser semantic representations before refining them back in the decoding process (Zhou et al., 8 Oct 2025).

3. Computational Efficiency and Scaling Properties

Next-scale prediction achieves marked improvements in both computational complexity and empirical inference time relative to both next-token autoregression and diffusion-based models.

  • Complexity comparison:
    • Next-token AR over n×nn\times n images: O(n6)O(n^6) (flattened raster scan with n2n^2 sequential steps and quadratic attention at each step).
    • Next-scale VAR: O(n4)O(n^4) (logarithmic number of scale transitions; each scale predicted in parallel).
    • Graphs: O(N3)O(N^3) node-wise AR vs O(N2logN)O(N^2\log N) for MAG with K=O(logN)K=O(\log N) scales (Belkadi et al., 30 Mar 2025).
    • 3D point clouds: scale-wise parallel prediction preserves permutation invariance, avoids O(N2)O(N^2) step complexity.
    • Inference times: 10–1000×\times faster than diffusion and node-wise AR, e.g., 0.19 s for 50 graphs in MAG vs 38 s (DiGress) (Belkadi et al., 30 Mar 2025).
  • Scaling Laws: Next-scale prediction preserves clean power-law scaling for loss as a function of model size and compute, similar to those observed in LLMs:

    Llast=(2.0N)0.23,Lavg=(2.5N)0.20L_{\rm last} = (2.0 N)^{-0.23}, \quad L_{\rm avg} = (2.5 N)^{-0.20}

    with Pearson ρ0.998\rho \approx -0.998 on log–log axes (Tian et al., 2024). As model/compute scales up, inference remains practical and quality (FID/IS) improves predictably.

4. Paradigm Extensions and Applications

The next-scale prediction paradigm has been instantiated and extended in a variety of domains:

5. Advantages, Limitations, and Theoretical Insights

Advantages:

  • Permutation invariance: Modeling is aligned with natural object structure (e.g., point sets, graphs, images) and avoids artifacts from artificial orderings.
  • Computational efficiency: Parallel prediction within scales, logarithmic number of scales, and avoidance of diffusion's iterative denoising yield unmatched throughput.
  • Hierarchical inductive bias: Coarse-to-fine schedules enable coordinated global structure capture before modeling fine details, simplifying learning.
  • Zero-shot generalization: Fully autoregressive at the map or scale level, next-scale models readily support in/out-painting, extension, and editing without retraining (Tian et al., 2024).
  • Power-law scalability: Empirically observed robust scaling laws enable effective extrapolation of model performance under compute constraints (Tian et al., 2024, Yao et al., 2023, Yan et al., 11 Nov 2025).

Limitations:

  • Tokenization and codebook learning: Requires robust and expressive multi-scale quantization (e.g., VQ-VAE, motif mining), which can be resource-intensive for large or irregular data types (Meng et al., 7 Oct 2025, Gailhard et al., 2 Jun 2025).
  • Parameter choices: The number of scales, codebook size, and latent channel dimension must be tuned to optimize the tradeoff between efficiency and reconstruction fidelity (Belkadi et al., 30 Mar 2025).
  • Exposure bias and error accumulation: While some implementations, such as xAR and Markov-VAR, tackle exposure bias, naive teacher-forcing can still lead to error propagation, particularly in deep hierarchies (Ren et al., 27 Feb 2025, Zhang et al., 28 Nov 2025).
  • Applicability factors: The approach assumes a natural multi-scale decomposition is available or can be learned. For domains lacking such structure, benefits may be limited.

6. Empirical Outcomes and Comparative Performance

Domain Model/Paradigm Quality Metric Speedup/Advantage
Image Generation VAR (Tian et al., 2024) FID = 1.97 @2B params ~20×\times faster than diffusion, SOTA FID
Visual Generation Markov-VAR (Zhang et al., 28 Nov 2025) FID ↓10.5% vs VAR Peak memory ↓83.8% at 1024×\times1024
Graph Generation MAG (Belkadi et al., 30 Mar 2025) Up to 3×103^3 faster High MMD matching, large speedup
Point Cloud Generation PointNSP (Meng et al., 7 Oct 2025) SOTA CD/EMD 5–10× faster than diffusion at 8k–80k points
Video Generation VideoAR (Ji et al., 9 Jan 2026) gFVD = 88.6 ~13× faster than prior AR, matches diffusion
Medical Segmentation AR-Seg (Chen et al., 28 Feb 2025) SOTA on mmSeg/datasets Explicit coarse-to-fine visualization
Denoising/SR NSP (Shan et al., 24 Dec 2025) SOTA PSNR/SSIM Unifies denoising and SR without retraining
Language Modeling HDLM (Zhou et al., 8 Oct 2025) Valid PPL 19.22 Outperforms MDLM/GIDD+, matches AR baseline

All results as reported in primary sources referenced above.

7. Outlook and Generalizations

The next-scale prediction paradigm is broadly extensible to any domain with inherent multiscale structure. Variants have been instantiated with discrete and continuous tokenizations, flow-matching and diffusion objectives, consensus-aggregation for robustness, and both causal and Markov attention across scale transitions. Promising future directions include:

Next-scale prediction thus unifies and generalizes autoregressive modeling, offering an efficient, permutation-aligned, and scalable approach for structured data generation, denoising, and beyond.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Next-Scale Prediction Paradigm.