Compute-Optimal Scaling Laws

Updated 6 February 2026

Compute-optimal scaling laws are quantitative formulas that prescribe the best allocation of compute among model size, data, and training steps to minimize error.
The methodology leverages power-law relationships and Pareto optimality to derive closed-form allocations for optimal neural network performance.
Empirical results highlight that deviations in pretraining, fine-tuning, and adaptive settings necessitate task-specific compute allocations for optimal outcomes.

Compute-optimal scaling laws quantitatively describe how to allocate a fixed compute budget among neural network model size, training steps (or data), and other architectural and optimization degrees of freedom to minimize generalization error or, equivalently, achieve maximal model performance. These laws constrain the feasible trade-offs on the Pareto-efficient “frontier” of training strategies, providing explicit allocation exponents and prescriptions—usually in closed-form—that specify the optimal division of compute as models and datasets are scaled. Theoretical derivations and empirical validation in recent literature have established a core universal structure for these scaling laws, elucidated their task-/data-dependence, and identified important deviations in pretraining, fine-tuning, and adaptive or multi-task settings.

1. Theoretical Foundations: Power-Law Parametrization and Pareto Optimality

At the core of compute-optimal scaling laws is the empirical observation that generalization error $E$ or held-out loss $L$ can be decomposed into leading bottleneck terms, each decaying as a power-law in a limiting resource (model width/parameter count $N$ , data size $D$ , or training time $T$ ):

$E(N,T) \simeq A\,N^{-\alpha} + B\,T^{-\beta}$

or, more generally,

$L(N, D) = L_0 + A_N\,N^{-\alpha} + A_D\,D^{-\beta}$

where $\alpha$ is the parameter-limited exponent and $\beta$ the data- or time-limited exponent (Bordelon et al., 2024, Sengupta et al., 17 Feb 2025, Roberts et al., 13 Mar 2025). Setting a fixed compute budget, often $C = N \times T$ or $C = N \times D$ , the problem reduces to a constrained minimization yielding allocation exponents for optimal $N^*$ and $T^*$ (or $D^*$ ):

$N^*(C) \propto C^{\gamma_N}, \quad T^*(C) \propto C^{\gamma_T}, \quad \gamma_N = \frac{\beta}{\alpha+\beta}, \quad \gamma_T = \frac{\alpha}{\alpha+\beta}$

This yields a compute-optimal error scaling of the form

$E^*(C) \propto C^{-\gamma}, \qquad \gamma = \frac{\alpha \beta}{\alpha+\beta}$

Analogous derivations hold for continuous architectural parameters and generalization to multiple task inputs (Bordelon et al., 2024, Sengupta et al., 17 Feb 2025).

2. Universal Formulas, Exponents, and Their Task Dependence

Across modern deep learning, this framework is instantiated with problem-specific exponents and fit constants. For language modeling (“Chinchilla”-style), typical empirical values—cross-entropy loss in terms of model parameters $N$ and training tokens $D$ —are (Sengupta et al., 17 Feb 2025):

$L(N, D) = 1.0 + 0.44\,N^{-0.34} + 0.39\,D^{-0.28}$

The compute-optimal model and data sizes are

$N^* \propto C^{0.45}, \quad D^* \propto C^{0.55}$

with exponents set by the scaling laws. In some domains (e.g. reinforcement learning, vision, symbolic regression), exponents differ but the form is invariant (Neumann et al., 2022, Alabdulmohsin et al., 2023, Otte et al., 30 Oct 2025). Tabulated exponents across modalities:

Domain	$\alpha$ (params)	$\beta$ (data)	$N^* \propto C^{\gamma_N}$	$D^* \propto C^{\gamma_T}$
LLMs (Chinchilla)	0.34	0.28	0.45	0.55
Vision (ViT)	~0.20–0.45	~0.22–0.60	varies per dimension	varies per dimension
RL (AlphaZero)	0.88	0.55	0.62	0.38 (implied)
SR (Symbolic Reg.)	—	—	~0.20 (loss)	~0.36 (solved rate)

These exponents determine whether the optimum is “capacity-hungry” (large $\alpha$ , more parameters) or “data-hungry” (large $\beta$ , more data/steps). Task-specific skill groupings (as in code-generation vs. knowledge QA) yield systematically shifted frontiers (Roberts et al., 13 Mar 2025).

3. Data, Architecture, and Skill-Dependent Deviations

Compute-optimal frontiers are sensitive to task, data distribution, and model architecture:

Data complexity: As training data becomes harder to compress (higher gzip-entropy $H$ ), both $\alpha, \beta$ decrease, but $\beta$ (data exponent) falls faster, shifting the frontier toward favoring more data rather than parameters (Pandey, 2024).
Skill specificity: Code-generation tasks display larger $\beta$ values than text QA, requiring more data at fixed compute, whereas knowledge QA is more parameter-limited (Roberts et al., 13 Mar 2025).
Mixture-of-experts, retrieval-augmentation, and multitask models: These architectures may not obey the classical exponents, displaying sublinear expert-count returns or shifting optimal token/parameter ratios (Sengupta et al., 17 Feb 2025).
Validation set composition: Compute-optimal recommendations can shift by up to 50% depending on whether the validation set reflects the mix of downstream skills (Roberts et al., 13 Mar 2025).

These findings indicate that any “universal” scaling law must be further conditioned on dataset complexity, skill composition, and architecture class (Sengupta et al., 17 Feb 2025, Pandey, 2024).

4. Adaptive and Dynamic Compute-Optimal Schedules

Moving beyond static allocation, compute-optimal laws have been generalized to adaptive schedules where model “shape” parameters (e.g. width, depth, patch size, context length) are increased during training (Anagnostidis et al., 2023, Alabdulmohsin et al., 2023):

At each error threshold, select the architecture yielding the greatest marginal reduction in loss per additional compute, following the steepest local descent among all scaling laws.
By piecewise following the lower envelope of per-shape scaling curves, the composite adaptive path achieves strictly lower loss or target error at a given compute—empirically saving 25–70% of training FLOP in practical scenarios.
This principle is validated for Transformer width/depth, patch size, and context length.

Shape-adaptive compute-optimality extends to multimodal and multi-domain mixture ratios, where the optimal domain weights (mixture fractions) are solved via the scaling law parameterized loss surface (Shukor et al., 12 Jul 2025).

5. Optimization Procedures and Practical Model Design

The optimization of compute-optimal allocations is algebraically tractable. Consider minimizing

$L(N, D) = L_0 + A_N N^{-\alpha} + A_D D^{-\beta}, \quad \text{s.t.} \quad N D = C$

This yields

$N^* = \left( \frac{\alpha A_N}{\beta A_D} \right)^{1/(\alpha + \beta)} C^{\gamma_N}, \quad D^* = \frac{C}{N^*}, \quad \gamma_N = \frac{\beta}{\alpha + \beta}$

with closed-form error at optimum. In the presence of additional constraints (such as time-to-train, batch size limits, or data caps), frontiers are further modified by the empirically fit scaling laws for optimal hyperparameters (batch size, learning rate, weight decay), which also scale as power laws in $N$ and $D$ (Bergsma et al., 19 May 2025, Porian et al., 2024). For high-variance, high-latency, or energy-constrained scenarios, additional optimization over inference compute versus training compute is required (Baniodeh et al., 9 Jun 2025).

Key recommendations:

For language tasks, when $\alpha \simeq \beta$ (as in Chinchilla), set $N \sim D \sim C^{1/2}$ ; for code tasks, allocate more compute to data.
In the presence of data scarcity, training for more epochs on existing data (up to $\sim$ 4) is nearly as beneficial as collecting new data (Sengupta et al., 17 Feb 2025).
For adaptive architecture schedules, initial pilot runs are essential to robustly fit shape-dependent exponents (Anagnostidis et al., 2023).
Fine-tuning and transfer tasks demand modified “rectified” or multi-phase scaling laws, with pre-power (inefficient) and post-power (efficient) regimes (Sengupta et al., 17 Feb 2025).

6. Extensions, Limitations, and Phase Behavior

Compute-optimal scaling laws admit generalizations to multidimensional and multiphase regimes:

The “4+3 phases” in random-feature models distinguish boundaries where scaling transitions from capacity-limited to feature-limited to optimizer-noise-limited, with each phase having distinct compute-optimal exponents (Paquette et al., 2024).
In fully information-theoretic settings, optimal allocations depend on data latent complexity and input dimension, with higher complexity favoring greater parameter allocation (Jeon et al., 2022).
Deviations from classical behavior arise in very high-complexity tasks, heterogeneous data distributions, and under heavy architectural bottlenecks, necessitating empirical re-fitting or theoretical extension.
Diminishing returns are quantifiable; past a data/model-size knee, further compute yields negligible improvement (Liu et al., 6 Jan 2026).

The compute-optimal framework also forms the analytical foundation for optimal planning in emerging domains such as motion planning, power systems, symbolic regression, and generative reasoning evaluations, with appropriate adjustment of exponents based on observed scaling fits (Baniodeh et al., 9 Jun 2025, Liu et al., 6 Jan 2026, Otte et al., 30 Oct 2025, Schaeffer et al., 28 Sep 2025).

7. Empirical Procedures and Application Guidelines

Practical implementation of compute-optimal scaling laws involves the following standard steps:

Select relevant task, skill, and data regime; fit the two-dimensional or multidimensional power-law scaling law to a sweep of small models/training budgets.
Solve the analytic constrained minimization to determine the scaling exponents and closed-form allocations for model and data size.
For data-dependent regimes, estimate dataset complexity (e.g., via gzip-compressibility) and adjust the scaling law coefficients and exponents accordingly (Pandey, 2024).
For adaptive or multi-domain tasks, fit shape-/mixture-aware scaling laws and optimize either sequentially (per dimension) or jointly.
Monitor for phase transitions or saturation points indicating deviation from power-law scaling; validate frontier predictions via pilot full-scale runs (Anagnostidis et al., 2023, Alabdulmohsin et al., 2023, Sengupta et al., 17 Feb 2025).
Use scaling-law-derived hyperparameter recipes for batch size, learning rate, and weight decay, which themselves scale as power laws in $N$ and $D$ (Bergsma et al., 19 May 2025, Porian et al., 2024, Otte et al., 30 Oct 2025).
Maintain alignment between the desired downstream skill mix and the composition of validation sets and data—misalignment can shift optimally allocated model size by tens of percent (Roberts et al., 13 Mar 2025).

A summary table of canonical compute-optimal allocations in language modeling and vision:

Scaling Law Type	Model Size ( $N^*$ )	Data Size ( $D^*$ )	Source
Chinchilla (LLM)	$N^* \sim C^{0.5}$	$D^* \sim C^{0.5}$	(Sengupta et al., 17 Feb 2025)
Vision Transformer	$d \sim C^{0.22}$ (width)	$L \sim C^{0.45}$ (depth), $m\sim C^{0.60}$ (MLP)	(Alabdulmohsin et al., 2023)
RL (AlphaZero)	$N^* \sim C^{0.62}$	—	(Neumann et al., 2022)
Symbolic Regression	$L \sim C^{-0.20}$ (loss)	$S \sim C^{0.36}$ (solved)	(Otte et al., 30 Oct 2025)

Compute-optimal scaling laws thus encode the quantitative structure underlying all modern large-scale neural network training design, distilling empirical and theoretical understanding into closed-form, universally applicable allocation rules, while accommodating critical deviations driven by task complexity, data heterogeneity, skill mixtures, and architecture. For further implementation details across specific domains and optimization regimes see (Bordelon et al., 2024, Sengupta et al., 17 Feb 2025, Porian et al., 2024, Pandey, 2024, Anagnostidis et al., 2023, Alabdulmohsin et al., 2023, Jeon et al., 2022, Roberts et al., 13 Mar 2025).