Compute Optimal Curves: Scaling and Transfer Principles

Updated 16 January 2026

Compute Optimal Curves are methods that derive and validate scaling laws for transferring optimal hyperparameters from small proxy settings to large-scale models.
They ensure training dynamics remain nontrivial via maximal update parametrization, promoting robust feature learning and reduced tuning cost.
The framework extends to measure-theoretic MUD and structured optimization, supporting real-time adaptation and consistent performance across diverse architectures.

Compute Optimal Curves refer to methodologies and parametrizations designed to facilitate the direct computation or transfer of optimal hyperparameters, parameter estimates, or dynamic update rules as model scale or problem complexity increases. The concept extends across both deep learning—where optimal "training curves" (e.g., loss vs hyperparameter landscapes) should align across architectural variants under the right parameterization—and stochastic inverse problems—where the updated density constructed via maximal-update principles yields an optimal parameter estimate. The unifying theme is the derivation and empirical validation of principled scaling laws and transfer mechanisms, allowing optimal configurations identified in computationally tractable settings to be reliably applied in regimes where direct tuning or search is infeasible.

1. Maximal Update Parametrization in Deep Learning

Maximal Update Parametrization (μP) is a scaling prescription for initialization variances and per-parameter learning rates such that, as the model width or key complexity parameter grows, both forward activations and parameter updates remain $\Theta(1)$ at each layer, preventing vanishing or exploding signals. In standard (NTK-style) scaling, initialization is set to $1/\sqrt{N}$ with learning rate $O(1)$ , which yields a "lazy" training regime dominated by kernel dynamics and little feature evolution. μP, by contrast, enables nontrivial feature learning at infinite width, and makes the optimal hyperparameters invariant under width scaling.

For Fourier Neural Operators (FNOs), μP prescribes that the initialization variance and learning rate for spectral kernels (parameterized by $K$ Fourier modes) scale as $1/\sqrt{d\log K}$ , where $d$ is the PDE's spatial dimension. The resulting training dynamics for both small ( $K_{\text{proxy}}$ ) and large ( $K^*$ ) operators are invariant up to a known reparameterization, allowing the transfer of optimal hyperparameters from proxy to target regimes with no additional tuning (Li et al., 24 Jun 2025). This produces "compute-optimal curves" in the sense that the loss–hyperparameter landscape remains unchanged, guaranteeing that transfer of discussed hyperparameter optima is valid.

2. Hyperparameter Transfer and the Compute-Optimal Recipe

The practicality of compute optimal curves is manifest in zero-shot hyperparameter transfer algorithms:

Tune on a proxy model with small $K$ (for FNOs, number of Fourier modes) or small width.
Apply derived scaling laws for learning rate and initialization variance, typically $\eta_0 \mapsto \eta_0\cdot\sqrt{\frac{\log K_{\text{proxy}}}{\log K^*}}$ and ${\rm Var}_0 \mapsto {\rm Var}_0\cdot\frac{1/\sqrt{d\log K^*}}{1/\sqrt{d\log K_{\text{proxy}}}}$ .
Train once at scale, with transferred hyperparameters optimal for the large model.

Empirical results show that this paradigm yields aligned loss–hyperparameter curves (cf. Figure 1 & 4 in (Li et al., 24 Jun 2025)), and end-to-end performance, such as final $L_2$ relative error, matches or surpasses direct tuning at scale, with dramatic reductions in computational cost. For example, μTransfer-FNO on $1$B-parameter Navier–Stokes yields comparable error to directly-tuned models at $0.30\times$ compute cost.

3. Generalization to Sparse and Structured Architectures

Static unstructured sparsity breaks the invariances of μP due to the shrinkage of effective activation and gradient magnitudes by $\sqrt{\rho}$ , where $\rho$ is the nonzero density. Sparse Maximal Update Parameterization (SμPar) reintroduces invariance by scaling both initialization and learning rate inversely with $\sqrt{d\rho}$ and $d\rho$ respectively, such that optimal hyperparameters again transfer across both width and sparsity (Dey et al., 2024). Empirically, SμPar enables compute-optimal loss curves across densities, yielding up to $8.2\%$ relative loss improvement at $99.2\%$ sparsity.

This transfer principle is also realized in second-order optimization, sharpness-aware minimization (SAM), and local learning algorithms, with each domain introducing appropriate scaling corrections to preserve nontrivial dynamics and hyperparameter transfer (see (Haas et al., 2024, Ishikawa et al., 2023, Ishikawa et al., 2024)).

4. Measure-Theoretic Compute Optimality: Maximal Updated Densities

Beyond deep learning, the "maximal update" paradigm is formalized in stochastic and epistemic inverse problems under the measure-theoretic maximal updated density (MUD) framework. Given a parameter-to-data map $Q$ and initial density $\pi_0(\theta)$ , data-consistent updated density is constructed as

$\pi_{\text{update}}(\theta) = \pi_0(\theta) \cdot \left[\frac{\pi_{\text{obs}}(Q(\theta))}{\pi_{\text{pred}}(Q(\theta))}\right],$

where $\pi_{\text{pred}}$ is the push-forward density under $Q$ (Pilosov et al., 2022). The compute-optimal parameter estimate is the MUD point $\theta_{\text{MUD}} = \arg\max \pi_{\text{update}}(\theta)$ , which maximizes the data-consistent density. Selective regularization ensures that only directions in parameter space not informed by data retain prior regularization, and MUD point theory provides existence and uniqueness under linear-Gaussian assumptions.

This framework further generalizes to sequential data arrival (sMUD), where compute-optimality is maintained via predictive-ratio diagnostics and automated sample reweighting or resampling, enabling real-time parameter drift detection and update (del-Castillo-Negrete et al., 2024). Applications spanning storm-surge models, heat conductivity PDEs, and change-point epidemiology validate the effectiveness and adaptivity of optimal curve computation in stochastic inverse modeling.

5. Algorithmic Implementation and Empirical Performance

The construction of compute-optimal curves universally involves:

Derivation of scaling laws for initialization, learning rate, perturbation radius, or damping terms (as appropriate for the learning algorithm or optimizer).
Empirical validation that either the optimal configuration landscape for the relevant metric (loss, accuracy, generalization error) is invariant to architectural scale, or that update dynamics remain $\Theta(1)$ across width, sparsity, or structural parameters.
Algorithmic recipes enabling "tune once, transfer everywhere" methodology.

For FNOs, a summary recipe is:

Step	μP Prescription	Empirical Justification
Draw R_ℓ entries	∼N(0, 1/(d log K))	Operator norm invariant
Adam LR	η₀/√(d log K)	Learning-rate optima align
Transfer η₀	η₀_big = η₀_small·√[log K_small/log K_big]	Loss–η curve collapses

For SμPar:

Step	SμPar Prescription	Effect
Initialization	σ_W = σ_base / √(d·ρ)	Activation norm invariant
Learning rate	η = η_base / (d·ρ)	Gradient and update norm invariant
Hyperparameter	Tune once, apply scaling per sparsity	Loss curves align, improvement in ultra-sparse regime

6. Unique Characteristics and Limitations

The salient property of compute-optimal curves under maximal-update parametrization is the reduction of hyperparameter tuning cost and preservation of feature evolution, even as the model scale transitions from tractable (proxy) to extreme (billion-parameter, high sparsity, large $K$ , or high depth). The loss-optimal location in hyperparameter space does not drift, and empirical generalization may improve with scale under correct parameterization.

However, maximal-update parametrizations depend critically on accurate derivation and adherence to the underlying scaling laws. Deviations, such as fixed-radius SAM (Haas et al., 2024) or conventional NTK scaling, collapse transferability and induce kernel-like/lazy regimes or unstable training. Additionally, extremely low density $\rho$ may violate large-fan-in assumptions and reintroduce signal degradation.

7. Connections and Future Directions

Compute-optimal curves and maximal-update principles now inform the design of optimization algorithms (second-order, sharpness-aware minimization), local learning (predictive coding, target propagation), and structured settings (sparse, spectral, multi-modal architectures). The foundational results extend to enabling robust training in low precision (u-μP (Blake et al., 2024)), measure-theoretic parameter estimation, and dynamic estimation pipelines (sequential MUD), all under a unifying mathematical framework of transfer-invariant scaling.

Ongoing research investigates the extension to dynamic sparsity, adaptive learning-rate schedules, cross-modal transfer, and higher-order correction to scaling laws, with theoretical developments (e.g., meta-principled scaling (Yaida, 2022)) mapping the landscape of effective feature-learning regimes in neural and stochastic computation.

The concept of compute optimal curves consolidates scalable and transferable optimality across machine learning and stochastic modeling, realized through rigorous scaling analysis, measure-theoretic update rules, and comprehensive empirical validation in large-scale, high-dimensional settings (Li et al., 24 Jun 2025, Dey et al., 2024, Pilosov et al., 2022, del-Castillo-Negrete et al., 2024).