Papers
Topics
Authors
Recent
Search
2000 character limit reached

Optimal LR Decay Schedule

Updated 18 February 2026
  • Optimal learning rate decay schedule is a strategy that adapts the learning rate in gradient-based optimizers to balance convergence speed and loss minimization.
  • It encompasses explicit decay functions and adaptive, meta-learned algorithms, tuned to problem convexity, nonstationarity, and resource constraints.
  • Empirical evaluations show that schedules like linear, polynomial, cosine, and warmup-stable-decay can be optimally adjusted using data-dependent and theoretical insights.

An optimal learning rate decay schedule is a strategy that dynamically adapts the learning rate of gradient-based optimizers to maximize convergence speed and minimize training loss, under constraints determined by the statistical structure of the learning task and optimization dynamics. This encompasses both explicit decay functions and adaptive, feedback-driven, or meta-learned scheduling algorithms. Optimal schedules are highly context-dependent, varying by problem convexity, nonstationarity, regime (e.g., data-limited vs. compute-limited), and model spectrum. Recent advances include theoretically-justified closed-form schedules, empirical procedures adapted to non-convex and non-stationary problems, adaptive, dynamic, and meta-learning-driven approaches, and decay-free alternatives based on checkpoint/model averaging.

1. Theoretical Underpinnings of Optimal Learning Rate Decay

Early work on stochastic optimization established that for convex, Lipschitz losses, a learning rate of the form ηt=a/(b+t)\eta_t = a/(b+t) or, more generally, a polynomial decay ηt=η0tβ\eta_t = \eta_0 t^{-\beta} (β[0,1]\beta\in[0,1]), balances bias–variance tradeoff and ensures convergence, with minimax optimality for β=1\beta=1 in convex settings. In high-dimensional, nonconvex landscapes, this paradigm fails: saddle proliferation and roughness require a slower decay or even an initial phase of constant learning rate to facilitate exploration (d'Ascoli et al., 2022). Analytical results in random landscape models establish that the optimal decay exponent is β<1\beta^*<1, typically β=1/2\beta^*=1/2 for quadratic loss, and in multi-phase models, a two-phase schedule (constant or slowly decaying during exploration, sharp decay upon entering a convex basin) is universally superior to monotonic decay.

The Functional Scaling Law (FSL) framework has recently unified these phenomena by connecting optimization dynamics to the capacity (β\beta) and source (ss) exponents of the target/model class. The FSL analysis reveals two regimes (Li et al., 23 Sep 2025, Li et al., 6 Feb 2026, Bordelon et al., 4 Feb 2026):

  • Easy regime (s11/βs \ge 1 - 1/\beta): The optimal schedule is a power decay, η(z)(1z/N)2β1\eta^*(z) \propto (1 - z/N)^{2\beta-1} with an appropriately shrinking peak rate.
  • Hard regime (s<11/βs < 1-1/\beta): The optimal schedule is a "warmup-stable-decay" (WSD): hold the maximum stable learning rate for nearly the entire training, then decay rapidly to zero over a vanishingly small fraction at the end.

This variational calculus approach yields power-law optimality for appropriate spectral conditions, and matches rates achieved by kernel regression lower bounds. In all cases, tuning the decay schedule to data-dependent exponents is critical for optimality (Li et al., 6 Feb 2026, Bordelon et al., 4 Feb 2026).

2. Canonical Schedules: Linear, Power, Step, and Cosine Decay

Standard learning rate decay schedules include:

  • Linear decay: ηt=η0(1t/T)\eta_t = \eta_0(1 - t/T); proven to be minimax optimal in convex settings for the final iterate, and typically matches or outperforms cosine in deep learning experiments (Defazio et al., 2023). Empirically, linearly decaying to zero outperforms cosine-to-10% or fixed decay in LLMs at compute-optimal scales (Bergsma et al., 21 Feb 2025).
  • Polynomial decay: ηt=η0(1t/T)p\eta_t = \eta_0 (1 - t/T)^p; higher pp increases robustness to misspecification, with O(ρ1/(2p+1)/T)O(\rho^{1/(2p+1)}/\sqrt{T}) tuning-robust convergence (Attia et al., 12 Mar 2025).
  • Cosine decay: ηt=ηmin+12(ηmaxηmin)[1+cos(πt/T)]\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max}-\eta_{min})[1 + \cos(\pi t/T)]; effective, especially when further modified, but not minimax optimal in all settings.
  • Step decay: Decay the learning rate by a constant factor every fixed number of epochs; shown to improve final-iterate minimax bounds over poly decay for least-squares, albeit at the cost of extra logT\log T terms (Ge et al., 2019).

Advanced forms, including REX (Reflected Exponential) and k-decay, modulate the end-of-training decay shape to tune the tradeoff between rapid initial descent and aggressive late-stage error reduction, showing improved empirical performance across diverse tasks (Chen et al., 2021, Zhang et al., 2020).

Schedule Optimal Regime Theoretical Guarantees
Linear Convex, many deep nets, LLMs (at scale) Minimax for last iterate
Power decay Nonconvex, high-dim, "easy" FSL regime Matches FSL optimal in s11/βs\ge1-1/\beta
WSD Nonconvex, "hard" FSL regime, LLM scaling Optimal for s<11/βs<1-1/\beta
Step decay Least squares, eigen-spectrum separation Minimax up to logT\log T
Cosine LLMs, vision (empirical), eigen-fat spectrum Near-optimal in practice
REX, k-decay Budget-uncertain/vision/finetune Top-1 in budget-uncertain

3. Data-Dependent and Adaptive Schedules

Adaptive and meta-learning schedules have emerged to address the nonstationarity and unpredictability of complex loss landscapes:

  • Dynamic Bandit-based Scheduling (LRRL): Casts the selection of learning rates as a multi-armed bandit problem, where each arm corresponds to a candidate LR, and rewards correspond to cumulative return (in RL) improvements over recent meta-steps. Weights for each arm are maintained and exponentially updated, with a softmax normalization to maintain exploration. The policy adapts by observing recent performance, shifting towards LRs that provide the steepest return gains. This technique outperforms well-tuned fixed and exponentially decaying baselines in nonstationary deep RL (Donâncio et al., 2024).
  • Gradient-norm adaptive refinement: Schedules that reweight LR trajectory by the empirical inverse-squared gradient norms, smoothened over recent windows, enable algorithmic warmup and rapid late decay, yielding additional 1–5% improvements over the best fixed polynomial/linear schedules (Defazio et al., 2023).
  • GreedyLR: A loss-driven multiplicative adjustment where ηt\eta_t is increased (divided by FF) if the loss decreases, and decreased (multiplied by FF) otherwise; closed-form results show FF^* achieves the optimal 1/Lmax1/L_{max} constant; robust and adaptive, outperforms cosine/exponential in convex and large-scale LLM settings (Subramanian et al., 16 Dec 2025).
  • Meta-criteria based schedules: Approaches that monitor signals such as the weight norm or variational parameter trajectory SNR to trigger decay (ABEL, DLRD) have shown optimal timing of decay events in DNNs and stochastic variational inference (Lewkowycz, 2021, Dinkel et al., 2024).
Method Adaptivity Signal Optimality Mode
LRRL Bandit, RL returns Nonstationary RL regimes
Gradient-norm gt||g_t|| or per-dim Any task, uses observed loss geometry
GreedyLR Instantaneous loss t\ell_t Proven O(1/T)O(1/T) and robust
ABEL Weight norm bounce DNN+WD, matched tuned step
DLRD Empirical SNR (SVI) Plateau-aware, SVI

4. Theory-Guided Schedules under Scaling Laws and Practical Constraints

The FSL analysis for LLM and kernel regression training yields closed-form, resource-constrained formulas for peak rate and decay:

  • In data-limited regimes, set ηpeakDs/(s+1)\eta_{peak}\sim D^{-s/(s+1)} (with DD = tokens or samples), decay shape as dictated by spectrum and source exponent (Li et al., 23 Sep 2025).
  • In compute-limited settings, optimal splitting yields capacity-scaling MC1/(a+1)M\sim C^{1/(a+1)} and data scaling DCa/(a+1)D\sim C^{a/(a+1)} (with compute C=MDC=MD) (Bordelon et al., 4 Feb 2026).
  • WSD-like schedules: Hold a constant maximal LR for up to 90% of training, decay to zero over final O(1/logD)O(1/\log D) fraction; empirically superior in LLMs and large-scale kernel regression (Li et al., 6 Feb 2026, Li et al., 23 Sep 2025).
  • Merge-based/decay-free schedules (WSM): Replace decay by merging the last kk checkpoints (with weights derived to mimic classical decay) post-hoc, achieving or surpassing the efficiency of tuned cosine or linear-decay schedules and allowing “anytime” training without restart or pre-specified horizons (Tian et al., 23 Jul 2025, Meterez et al., 3 Feb 2026, Luo et al., 24 Nov 2025).

These predictions have been validated up to 1–7B parameter LLMs, with D2Z linear decay, WSD, and merge-based/EMA-based approaches all surpassing classical warmup–cosine–to–10% scheduling under Chinchilla-scaling constraints (Bergsma et al., 21 Feb 2025, Meterez et al., 3 Feb 2026).

5. Non-Idealities: Distribution Shift, Curriculum Training, and Robustness

Learning rate schedule optimality is context-sensitive:

  • Distribution shift: When data distributions drift, optimal policies require increasing (not only decaying) the learning rate at times of high shift (“catch up” phenomenon). Explicit closed-form adaptive schedules are derived for SGD in shifting linear, convex, and nonconvex losses, and tested in neural network online learning (Fahrbach et al., 2023).
  • Curriculum learning: In LLM curriculum-based pretraining, severe LR decay before high-quality data arrive negates the benefit of data ordering. Moderating the final decay, using a nonzero final LR (e.g., ηT=13ηpeak\eta_T= \frac{1}{3}\eta_{peak}), or combining with checkpoint/model averaging, is required to extract curriculum signal (Luo et al., 24 Nov 2025).
  • Tuning robustness: Polynomial decay, cosine decay, and WSD all exhibit milder degradation under misspecification of η0\eta_0, compared with fixed LR, due to the sublinear dependence of the final regret on grid-spaced tuning factors (Attia et al., 12 Mar 2025).

6. Empirical Guidelines and Implementation

Research over the last five years yields the following practical protocols for schedule selection and tuning:

7. Limitations and Open Directions

Current research recognizes several intrinsic and practical caveats:

  • All schedules depend on proper estimation or transfer of initial learning rate η0\eta_0; underestimation may negate theoretical optimality.
  • The optimal decay schedule is acutely sensitive to feature spectrum and the “phase” of the optimization problem; misidentification may yield sub-optimal rates even if the schedule shape is flexible (Li et al., 6 Feb 2026).
  • Many theoretically optimal schedules require knowledge of the total training horizon TT; “anytime” or horizon-free policies using weight/model averaging offer a consistent, horizon-agnostic substitute (Meterez et al., 3 Feb 2026).
  • True optimality in distribution-shifted or adversarial regimes requires feedback-driven or meta-learning approaches and escapes the reach of monotonic decay formulas (Fahrbach et al., 2023, Donâncio et al., 2024).
  • In some contexts such as curriculum learning, co-designing LRS and data ordering is necessary to avoid negating the potential gains of curricula (Luo et al., 24 Nov 2025).
  • Joint optimization of learning rate and batch size, and the extension to momentum, further improves wall-clock efficiency and convergence, but depends on explicit solution of coupled optimal-control equations (Bordelon et al., 4 Feb 2026).

The problem of universally optimal decay schedule selection thus remains open-ended, but contemporary theoretical and empirical research delineates a precise and actionable menu of options for nearly all regimes encountered in modern large-scale and nonconvex optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Optimal Learning Rate Decay Schedule.