Papers
Topics
Authors
Recent
Search
2000 character limit reached

Power-Law Training Dynamics

Updated 8 February 2026
  • Power-law training dynamics are defined by losses decaying as power-law functions of training time, model size, and data, reflecting scale-free behavior.
  • Recent research leverages operator theory and spectral dynamics to explain multi-regime learning curves and phase transitions across various deep learning models.
  • Understanding these dynamics informs optimal resource allocation and scheduling, enabling precise forecasting of training performance in large-scale neural networks.

Power-law training dynamics refer to the observation that learning curves—describing error, loss, or another relevant performance metric as a function of training time, data, model size, or compute—follow power-law decay over extensive regions in parameter, data, or time scales. Such laws have been empirically verified across deep learning, kernel machines, diffusion models, reinforcement learning, and random feature settings, with growing theoretical understanding rooted in operator theory, spectral dynamics, and implicit bias arguments. The structure, universality, and mechanisms underlying these power-law behaviors form an essential theoretical toolkit for analyzing, optimizing, and forecasting model training at scale.

1. Mathematical Structures and Key Forms

Power-law training dynamics are formally stated as

L(t)ctα+LirrL(t) \sim c\,t^{-\alpha} + L_{\mathrm{irr}}

or in more general multi-factor form,

L(t,N,P)Atαt+BNαN+CPαP+LirrL(t, N, P) \approx A\,t^{-\alpha_t} + B\,N^{-\alpha_N} + C\,P^{-\alpha_P} + L_\text{irr}

where LL is the loss (or test error), tt is training time or steps, NN the number of model parameters, PP the data size, and α\alpha_\star are the scaling exponents, often empirically and theoretically determined by properties of the data (spectral decay of the data covariance or Hessian), the architecture, or the optimization protocol (Bordelon et al., 2024, D'Amico et al., 19 May 2025, Zhang, 20 Dec 2025, Wang, 5 Mar 2025, Meir et al., 2022). In nonlinear or multi-phase settings, more complex forms show transitions, plateaus, or multiple distinct regimes (Worschech et al., 2024, Braun et al., 24 Nov 2025).

Originating in critical phenomena, these forms encode scale-free decay with no characteristic timescale or size threshold up to the regime limitations set by architectural bottlenecks or finite resource ceilings (Meir et al., 2022, Hutter, 2021).

2. Operator-Theoretic and Spectral Mechanisms

Recent theory unifies power-law training dynamics via spectral transport-dissipation PDEs derived from the evolution of the error in function space (Zhang, 11 Dec 2025, Zhang, 20 Dec 2025):

tg(λ,t)+λ(v(λ,t)g(λ,t))=λg(λ,t)+S(λ,t)\partial_t g(\lambda, t) + \partial_\lambda(v(\lambda, t)\,g(\lambda, t)) = -\lambda\,g(\lambda, t) + S(\lambda, t)

where g(λ,t)g(\lambda, t) represents the amplitude in mode λ\lambda (an eigenvalue of a parameter-to-function kernel or Hessian), v(λ,t)v(\lambda, t) is a drift velocity (often power-law in λ\lambda), and SS encodes spectral mode-coupling. Under sufficient regularity and weak-coupling conditions, SS is negligible so that the primary evolution combines drift and local relaxation. Learning rates across shells or resolution scales are governed by coarse-grained conservation laws as in the Generalized Resolution-Shell Dynamics (GRSD) framework (Zhang, 20 Dec 2025), which require graph-banded locality, incoherence, and log-shift invariance for genuine renormalizability and scaling-law emergence.

Self-similar solutions to these PDEs yield explicit scaling law exponents: L(t)tγ,γ=2b1b1L(t) \propto t^{-\gamma},\quad \gamma=\frac{2b-1}{b-1} depending on the "spectral drift" exponent bb determined by operator regularity and the eigenvalue decay, connecting the macroscopic loss exponents directly to microscopic architecture and data properties (Zhang, 11 Dec 2025, Zhang, 20 Dec 2025).

3. Roots in Data and Model Spectra

The universality and values of the power-law exponents are intimately linked to the data covariance or Hessian spectral decay. For data with covariance eigenvalues λkkβ\lambda_k \sim k^{-\beta}, learning curve exponents are determined by spectral integrals and mode-dependent relaxation (Wang, 5 Mar 2025, Worschech et al., 2024): L(t)tβ1+βL(t) \propto t^{-\frac{\beta}{1+\beta}} or more generally, the time to learn a given mode with variance λk\lambda_k scales Tkλk1T_k\sim\lambda_k^{-1} ("inverse variance law"), a phenomenon known as power-law spectral bias (Wang, 5 Mar 2025). In student–teacher and quadratic random feature models, similar power laws are derived, and phase transitions between fast and slow regimes are set by the spectrum's tail index (heavy vs. light) (Worschech et al., 2024, Arous et al., 5 Aug 2025).

In shallow memorization/dictionary models, the test error falls as nβn^{-\beta} with β\beta set by the decay of label frequencies θii(1+α)\theta_i \sim i^{-(1+\alpha)} as β=α/(1+α)\beta = \alpha/(1+\alpha) (Hutter, 2021). This underscores that universality arises only when the data distribution is sufficiently "heavy-tailed".

4. Architectures, Optimization Protocols, and Universal Regimes

While precise exponents require detailed spectral knowledge, certain universality classes emerge due to architectural or algorithmic structure:

  • Superposition bottlenecks: When high-dimensional inputs are compressed via random projections or shared embedding spaces (as in LLMs), a bottleneck induces a universal L(t)t1L(t)\sim t^{-1} training law, independent of data statistics, accompanied by a dramatic acceleration over sequential learning (Chen et al., 1 Feb 2026).
  • Softmax + cross-entropy output layers in LLMs: Imposes a universal time exponent $1/3$, so that L(t)t1/3L(t)\propto t^{-1/3}. This is attributed to the analytic structure of the softmax at low temperature, not to any data spectrum (Liu et al., 3 Feb 2026).
  • Dynamical implicit bias: Even in the presence of unbounded norm growth, implicit maximization of spectral complexity by gradient descent yields universal L(λ)λγ1L(\lambda) \sim \lambda^{-\gamma_1} law relating test error to norm, and an optimum-norm scaling λopt(P)Pγ2\lambda_\mathrm{opt}(P)\sim P^{\gamma_2}, whose product recovers the classic data-size scaling law at late times (D'Amico et al., 19 May 2025).

The phase portrait is often multi-regime, with initial plateaus, "escape" or symmetry-breaking phases, followed by slow spectral-tail convergence governed by the power-law exponents (Braun et al., 24 Nov 2025, Worschech et al., 2024).

5. Impact on Training Efficiency, Scheduling, and Optimization

Power-law dynamics are crucial for scheduling and optimal resource allocation:

  • Compute-optimal scaling laws (random feature and kernel regimes): To optimize loss for a fixed compute budget C=NtC=Nt, allocate NC1/(b+1)N\sim C^{1/(b+1)}, tCb/(b+1)t\sim C^{b/(b+1)} when data eigenvalue spectrum is λkkb\lambda_k\sim k^{-b}, reflecting that more computational effort should be spent on increasing training time when the spectrum is flatter (Bordelon et al., 2024, Bordelon et al., 4 Feb 2026).
  • Multi-power law loss prediction: Loss curves across complex, non-monotonic learning-rate schedules are quantitatively predicted by multi-power laws, combining base S(t)αS(t)^{-\alpha} scaling with loss-reduction terms at each LR decay, enabling the discovery of slightly superior schedules and offering a general framework for loss forecasting (Luo et al., 17 Mar 2025).
  • Learning rate and batch size schedules: In power-law random feature models, the theoretically optimal learning rate decays polynomially or follows warmup-plateau-decay (WSD) profiles depending on regime; batch size ramps and joint optimization are required for wall-clock time optimality (Bordelon et al., 4 Feb 2026).
  • Differentiation across supervised, RL, and generative modeling: Neural scaling laws extend robustly to single-agent reinforcement learning, where “intrinsic performance” scales as a power law in model size and interaction, with exponents similar to those in generative and supervised settings (Hilton et al., 2023, Maloney et al., 2022).

6. Limitations, Breakdown, and Special Cases

Power-law regimes end when effective resources (model size, data, or compute) saturate latent structure:

  • Plateau and phase transition: When model/data exhaust the nontrivial spectrum (e.g., N×P>N\times P > "intrinsic dimension"), loss arrests at a noise floor or irreducible variance (Maloney et al., 2022).
  • Spectral transition points: Phase transitions or crossover in loss decay (from exponential to power-law) can occur upon transitioning from bulk to tail learning in the data spectrum (Worschech et al., 2024, Braun et al., 24 Nov 2025).
  • Architectural exceptions: Certain architectures (e.g., ReLU threshold-power-law RNNs) violate scale invariance such that coupling strength or hyperparameters must be explicitly tuned; non-ReLU threshold power-law networks, by contrast, are scale-invariant and their training accuracy is independent of coupling strength (Nicola, 30 Nov 2025).

Power-law dynamics can also be perturbed or destroyed by loss of functional regularity, excessive nonlocal spectral coupling, or schedule/architecture choices that violate shift-invariance and locality assumptions required for renormalizability (Zhang, 20 Dec 2025, Zhang, 11 Dec 2025, Nakazato, 2024).

7. Implications and Applications

Power-law training dynamics enable:

The continued study of power-law training dynamics, including operator origins, phase transitions, and the impact of model structure on universality classes, remains central for both fundamental theory and the rapid empirical progress in large-scale deep learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
8.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Power-Law Training Dynamics.