Optimal LR Decay Schedule
- Optimal learning rate decay schedule is a strategy that adapts the learning rate in gradient-based optimizers to balance convergence speed and loss minimization.
- It encompasses explicit decay functions and adaptive, meta-learned algorithms, tuned to problem convexity, nonstationarity, and resource constraints.
- Empirical evaluations show that schedules like linear, polynomial, cosine, and warmup-stable-decay can be optimally adjusted using data-dependent and theoretical insights.
An optimal learning rate decay schedule is a strategy that dynamically adapts the learning rate of gradient-based optimizers to maximize convergence speed and minimize training loss, under constraints determined by the statistical structure of the learning task and optimization dynamics. This encompasses both explicit decay functions and adaptive, feedback-driven, or meta-learned scheduling algorithms. Optimal schedules are highly context-dependent, varying by problem convexity, nonstationarity, regime (e.g., data-limited vs. compute-limited), and model spectrum. Recent advances include theoretically-justified closed-form schedules, empirical procedures adapted to non-convex and non-stationary problems, adaptive, dynamic, and meta-learning-driven approaches, and decay-free alternatives based on checkpoint/model averaging.
1. Theoretical Underpinnings of Optimal Learning Rate Decay
Early work on stochastic optimization established that for convex, Lipschitz losses, a learning rate of the form or, more generally, a polynomial decay (), balances bias–variance tradeoff and ensures convergence, with minimax optimality for in convex settings. In high-dimensional, nonconvex landscapes, this paradigm fails: saddle proliferation and roughness require a slower decay or even an initial phase of constant learning rate to facilitate exploration (d'Ascoli et al., 2022). Analytical results in random landscape models establish that the optimal decay exponent is , typically for quadratic loss, and in multi-phase models, a two-phase schedule (constant or slowly decaying during exploration, sharp decay upon entering a convex basin) is universally superior to monotonic decay.
The Functional Scaling Law (FSL) framework has recently unified these phenomena by connecting optimization dynamics to the capacity () and source () exponents of the target/model class. The FSL analysis reveals two regimes (Li et al., 23 Sep 2025, Li et al., 6 Feb 2026, Bordelon et al., 4 Feb 2026):
- Easy regime (): The optimal schedule is a power decay, with an appropriately shrinking peak rate.
- Hard regime (): The optimal schedule is a "warmup-stable-decay" (WSD): hold the maximum stable learning rate for nearly the entire training, then decay rapidly to zero over a vanishingly small fraction at the end.
This variational calculus approach yields power-law optimality for appropriate spectral conditions, and matches rates achieved by kernel regression lower bounds. In all cases, tuning the decay schedule to data-dependent exponents is critical for optimality (Li et al., 6 Feb 2026, Bordelon et al., 4 Feb 2026).
2. Canonical Schedules: Linear, Power, Step, and Cosine Decay
Standard learning rate decay schedules include:
- Linear decay: ; proven to be minimax optimal in convex settings for the final iterate, and typically matches or outperforms cosine in deep learning experiments (Defazio et al., 2023). Empirically, linearly decaying to zero outperforms cosine-to-10% or fixed decay in LLMs at compute-optimal scales (Bergsma et al., 21 Feb 2025).
- Polynomial decay: ; higher increases robustness to misspecification, with tuning-robust convergence (Attia et al., 12 Mar 2025).
- Cosine decay: ; effective, especially when further modified, but not minimax optimal in all settings.
- Step decay: Decay the learning rate by a constant factor every fixed number of epochs; shown to improve final-iterate minimax bounds over poly decay for least-squares, albeit at the cost of extra terms (Ge et al., 2019).
Advanced forms, including REX (Reflected Exponential) and k-decay, modulate the end-of-training decay shape to tune the tradeoff between rapid initial descent and aggressive late-stage error reduction, showing improved empirical performance across diverse tasks (Chen et al., 2021, Zhang et al., 2020).
| Schedule | Optimal Regime | Theoretical Guarantees |
|---|---|---|
| Linear | Convex, many deep nets, LLMs (at scale) | Minimax for last iterate |
| Power decay | Nonconvex, high-dim, "easy" FSL regime | Matches FSL optimal in |
| WSD | Nonconvex, "hard" FSL regime, LLM scaling | Optimal for |
| Step decay | Least squares, eigen-spectrum separation | Minimax up to |
| Cosine | LLMs, vision (empirical), eigen-fat spectrum | Near-optimal in practice |
| REX, k-decay | Budget-uncertain/vision/finetune | Top-1 in budget-uncertain |
3. Data-Dependent and Adaptive Schedules
Adaptive and meta-learning schedules have emerged to address the nonstationarity and unpredictability of complex loss landscapes:
- Dynamic Bandit-based Scheduling (LRRL): Casts the selection of learning rates as a multi-armed bandit problem, where each arm corresponds to a candidate LR, and rewards correspond to cumulative return (in RL) improvements over recent meta-steps. Weights for each arm are maintained and exponentially updated, with a softmax normalization to maintain exploration. The policy adapts by observing recent performance, shifting towards LRs that provide the steepest return gains. This technique outperforms well-tuned fixed and exponentially decaying baselines in nonstationary deep RL (Donâncio et al., 2024).
- Gradient-norm adaptive refinement: Schedules that reweight LR trajectory by the empirical inverse-squared gradient norms, smoothened over recent windows, enable algorithmic warmup and rapid late decay, yielding additional 1–5% improvements over the best fixed polynomial/linear schedules (Defazio et al., 2023).
- GreedyLR: A loss-driven multiplicative adjustment where is increased (divided by ) if the loss decreases, and decreased (multiplied by ) otherwise; closed-form results show achieves the optimal constant; robust and adaptive, outperforms cosine/exponential in convex and large-scale LLM settings (Subramanian et al., 16 Dec 2025).
- Meta-criteria based schedules: Approaches that monitor signals such as the weight norm or variational parameter trajectory SNR to trigger decay (ABEL, DLRD) have shown optimal timing of decay events in DNNs and stochastic variational inference (Lewkowycz, 2021, Dinkel et al., 2024).
| Method | Adaptivity Signal | Optimality Mode |
|---|---|---|
| LRRL | Bandit, RL returns | Nonstationary RL regimes |
| Gradient-norm | or per-dim | Any task, uses observed loss geometry |
| GreedyLR | Instantaneous loss | Proven and robust |
| ABEL | Weight norm bounce | DNN+WD, matched tuned step |
| DLRD | Empirical SNR (SVI) | Plateau-aware, SVI |
4. Theory-Guided Schedules under Scaling Laws and Practical Constraints
The FSL analysis for LLM and kernel regression training yields closed-form, resource-constrained formulas for peak rate and decay:
- In data-limited regimes, set (with = tokens or samples), decay shape as dictated by spectrum and source exponent (Li et al., 23 Sep 2025).
- In compute-limited settings, optimal splitting yields capacity-scaling and data scaling (with compute ) (Bordelon et al., 4 Feb 2026).
- WSD-like schedules: Hold a constant maximal LR for up to 90% of training, decay to zero over final fraction; empirically superior in LLMs and large-scale kernel regression (Li et al., 6 Feb 2026, Li et al., 23 Sep 2025).
- Merge-based/decay-free schedules (WSM): Replace decay by merging the last checkpoints (with weights derived to mimic classical decay) post-hoc, achieving or surpassing the efficiency of tuned cosine or linear-decay schedules and allowing “anytime” training without restart or pre-specified horizons (Tian et al., 23 Jul 2025, Meterez et al., 3 Feb 2026, Luo et al., 24 Nov 2025).
These predictions have been validated up to 1–7B parameter LLMs, with D2Z linear decay, WSD, and merge-based/EMA-based approaches all surpassing classical warmup–cosine–to–10% scheduling under Chinchilla-scaling constraints (Bergsma et al., 21 Feb 2025, Meterez et al., 3 Feb 2026).
5. Non-Idealities: Distribution Shift, Curriculum Training, and Robustness
Learning rate schedule optimality is context-sensitive:
- Distribution shift: When data distributions drift, optimal policies require increasing (not only decaying) the learning rate at times of high shift (“catch up” phenomenon). Explicit closed-form adaptive schedules are derived for SGD in shifting linear, convex, and nonconvex losses, and tested in neural network online learning (Fahrbach et al., 2023).
- Curriculum learning: In LLM curriculum-based pretraining, severe LR decay before high-quality data arrive negates the benefit of data ordering. Moderating the final decay, using a nonzero final LR (e.g., ), or combining with checkpoint/model averaging, is required to extract curriculum signal (Luo et al., 24 Nov 2025).
- Tuning robustness: Polynomial decay, cosine decay, and WSD all exhibit milder degradation under misspecification of , compared with fixed LR, due to the sublinear dependence of the final regret on grid-spaced tuning factors (Attia et al., 12 Mar 2025).
6. Empirical Guidelines and Implementation
Research over the last five years yields the following practical protocols for schedule selection and tuning:
- For convex or deep settings, a brief linear warmup ($2$–$10$\% of steps) followed by linear D2Z decay matches or exceeds the best alternatives in most settings. Peak can be found on proxy models, then transferred (Defazio et al., 2023, Bergsma et al., 21 Feb 2025).
- In LLMs or transformer pre-training, WSD or D2Z with tuned on proxy data, with the decay phase set to occupy just $10$–$20$\% of training, attains compute-optimal power-law loss scaling (Li et al., 23 Sep 2025, Bergsma et al., 21 Feb 2025, Li et al., 6 Feb 2026).
- In high-capacity or data-limited “hard tasks”, push decay as late as stability permits and monitor for noise-induced plateaus.
- For reinforcement learning or highly non-stationary tasks, use adaptive bandit or gradient-norm/loss-based schedules (Donâncio et al., 2024, Subramanian et al., 16 Dec 2025).
- For SVI, variational inference, or when oscillations/plateaus dominate, trigger decays on empirical signal-to-noise or weight norm bounce (Dinkel et al., 2024, Lewkowycz, 2021).
- For robust hyperparameter search, use polynomial or cosine annealing to reduce tuning sensitivity, or employ schedules with analytically-reduced -dependent regret (Attia et al., 12 Mar 2025).
7. Limitations and Open Directions
Current research recognizes several intrinsic and practical caveats:
- All schedules depend on proper estimation or transfer of initial learning rate ; underestimation may negate theoretical optimality.
- The optimal decay schedule is acutely sensitive to feature spectrum and the “phase” of the optimization problem; misidentification may yield sub-optimal rates even if the schedule shape is flexible (Li et al., 6 Feb 2026).
- Many theoretically optimal schedules require knowledge of the total training horizon ; “anytime” or horizon-free policies using weight/model averaging offer a consistent, horizon-agnostic substitute (Meterez et al., 3 Feb 2026).
- True optimality in distribution-shifted or adversarial regimes requires feedback-driven or meta-learning approaches and escapes the reach of monotonic decay formulas (Fahrbach et al., 2023, Donâncio et al., 2024).
- In some contexts such as curriculum learning, co-designing LRS and data ordering is necessary to avoid negating the potential gains of curricula (Luo et al., 24 Nov 2025).
- Joint optimization of learning rate and batch size, and the extension to momentum, further improves wall-clock efficiency and convergence, but depends on explicit solution of coupled optimal-control equations (Bordelon et al., 4 Feb 2026).
The problem of universally optimal decay schedule selection thus remains open-ended, but contemporary theoretical and empirical research delineates a precise and actionable menu of options for nearly all regimes encountered in modern large-scale and nonconvex optimization.