A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

Published 17 Mar 2025 in cs.LG, cs.AI, cs.CL, and stat.ML | (2503.12811v1)

Abstract: Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we present an empirical law that describes how the pretraining loss of LLMs evolves under different learning rate schedules, such as constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay. We extensively validate this law on various model sizes and architectures, and demonstrate that after fitting on a few learning rate schedules, the law accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (Hu et al, 2024) but achieves a slightly lower final loss. We believe these results could offer valuable insights for understanding the dynamics of pretraining and designing learning rate schedules to improve efficiency.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a Multi-Power Law framework that accurately predicts the entire loss curve of LLM pretraining using cumulative learning rate sums and a power-law decay term.
It employs a novel empirical approach that outperforms existing methods like the Chinchilla and Momentum Laws through effective schedule-aware loss prediction.
Optimized learning rate schedules derived from MPL demonstrate improved training efficiency and stability across various model sizes and non-monotonic scheduling scenarios.

Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

Motivation and Problem Statement

The paper introduces the Multi-Power Law (MPL), an empirical scaling law for predicting the entire pretraining loss curve of LLMs under arbitrary learning rate (LR) schedules. Existing scaling laws, such as the Chinchilla law, model final loss as a function of model size and training steps, but do not account for the LR schedule, which is a critical hyperparameter for optimization efficiency and stability. The MPL addresses the schedule-aware loss curve prediction problem: given a sequence of learning rates $E = \{\eta_1, \eta_2, \dots, \eta_T\}$ , can one accurately predict the training loss %%%%1%%%% at every step?

Derivation and Formula of the Multi-Power Law

The MPL is derived through a bottom-up empirical approach, starting from constant LR schedules, then analyzing two-stage and multi-stage decaying schedules, and finally generalizing to arbitrary monotonic and non-monotonic schedules. The law is formulated as:

$L(t) = L_0 + A \cdot (S_1(t)+S_W)^{-\alpha} - LD(t)$

where $S_1(t) = \sum_{\tau=1}^t \eta_\tau$ is the cumulative LR sum post-warmup, $S_W$ is the warmup LR sum, and $LD(t)$ is a loss reduction term capturing the effect of LR decay:

$LD(t) = B \sum_{k=1}^t (\eta_{k-1} - \eta_k) \cdot G(\eta_k^{-\gamma} S_k(t)), \quad S_k(t) = \sum_{\tau=k}^t \eta_\tau$

with $G(x) = 1 - (Cx + 1)^{-\beta}$ , and $L_0, A, B, C, \alpha, \beta, \gamma$ are fit from a small number of training runs.

The first two terms generalize the Chinchilla law by replacing the number of steps with the cumulative LR sum, while the $LD(t)$ term models the additional loss reduction from LR decay, empirically shown to follow a power law rather than an exponential form.

Empirical Validation

Extensive experiments validate the MPL across Llama-2, GPT-2, and OLMo architectures, with model sizes from 25M to 1B parameters and training horizons up to 72,000 steps. The law is fit using only two or three schedule-loss curve pairs and generalizes to unseen schedules, longer horizons, and non-monotonic schedules (e.g., cyclic, random-polyline).

Figure 1: MPL achieves higher $R^2$ and lower error metrics than the Momentum Law across 25M, 100M, and 400M models for unseen schedules.

Figure 2: MPL accurately predicts loss curves for long-horizon cyclic and random-polyline schedules, demonstrating generalization to non-monotonic LR schedules.

Figure 3: MPL fitting and prediction details for various model sizes, step lengths, and LR schedules, showing high accuracy across configurations.

Ablation studies show that omitting the $LD(t)$ term or replacing the power law with an exponential form degrades prediction accuracy. The law is robust to changes in peak LR, batch size, and random seed, with prediction errors approaching the lower bound set by seed variability.

Theoretical Analysis

A theoretical justification is provided for quadratic loss functions with power-law spectra in the Hessian and noise covariance. Under these assumptions, the expected loss curve of SGD with arbitrary LR schedule matches the MPL form, with the loss reduction term saturating as a power law in the scaled LR sum. The analysis highlights the connection between spectral properties of the optimization landscape and the empirical scaling law, though the explicit dependence on $\eta_k$ in practice is not fully captured by the quadratic theory.

Schedule Optimization and Practical Implications

By minimizing the predicted final loss under the MPL, the authors optimize the LR schedule via gradient-based methods, subject to monotonicity constraints. The resulting schedules consistently outperform cosine and WSD schedules, both in final loss and downstream task performance.

Figure 4: The optimized LR schedule (Opt) outperforms cosine and WSD schedules for a 400M Llama-2 model, with a stable phase followed by a power-law decay.

Figure 5: Optimized schedules and their loss curves compared to cosine, WSD, and WSDLD schedules for 25M and 100M models.

Figure 6: Optimized and cosine LR schedules and corresponding loss curves for long-horizon training across model sizes.

The optimized schedules exhibit a Warmup-Stable-Decay (WSD) pattern, but with a decay phase following a power law rather than linear or exponential decay. The optimized schedules decay to near-zero LR, outperforming WSD variants even when their ending LR is tuned to zero.

Comparison to Baselines

MPL outperforms the Chinchilla law and the Momentum Law (MTL) in both sample efficiency and prediction accuracy. While Chinchilla requires multiple full training runs for fitting, MPL achieves superior accuracy with only a few curves. MTL, which models loss reduction as an exponential function of LR reduction, provably yields suboptimal collapsed schedules and fails to match the empirical power-law behavior observed in loss reduction.

Implications and Future Directions

The MPL provides a practical tool for schedule-aware loss curve prediction, enabling efficient hyperparameter tuning and schedule optimization for LLM pretraining. Its empirical accuracy and theoretical grounding suggest that power-law relationships govern not only model/data scaling but also the dynamics induced by LR schedules. The law's ability to generalize to arbitrary schedules and long horizons is particularly relevant for compute-constrained training and continual learning scenarios.

Future work should refine the theoretical analysis to account for non-quadratic loss landscapes and further investigate the dependence of MPL parameters on other hyperparameters. Extending MPL to capture the effects of adaptive optimizers, data mixing, and transfer learning remains an open challenge.

Conclusion

The Multi-Power Law establishes a robust, empirically validated framework for predicting the entire loss curve of LLM pretraining under arbitrary LR schedules. It enables efficient schedule optimization, surpasses existing scaling laws in accuracy and sample efficiency, and provides new insights into the interplay between optimization dynamics and hyperparameter schedules. The approach is extensible to diverse architectures and training regimes, with significant implications for both theoretical understanding and practical deployment of large-scale models.