WSD Learning Rate Scheduling

Updated 31 January 2026

WSD learning rate scheduling is a three-phase approach integrating warmup, stable plateau, and decay for effective model training.
It combines robust theoretical optimization bounds with practical insights from loss landscape analysis and scaling laws.
Practical variants and tuning guidelines of WSD enhance convergence and data efficiency in transformers and large language models.

Warmup–Stable–Decay (WSD) learning rate scheduling is a three-phase learning rate schedule that has become a de facto standard for large-scale model training, particularly for transformers and LLMs. The WSD paradigm offers strong theoretical justification, empirical flexibility, and efficient optimization trajectories, and has been validated across architectures and application domains. It unifies initial stabilization (“warmup”), efficient exploration (“stable plateau”), and accelerated final convergence (“decay”), making it central to current best practices for model pretraining and fine-tuning.

1. Mathematical Formulation and Phases

The WSD learning rate schedule is defined by a piecewise function over training iteration $t \in [0, T]$ , split into three contiguous phases: warmup, stable (plateau), and decay. For total step count $T$ , warmup steps $T_w$ , stable duration $T_s$ , decay steps $T_d$ (with $T = T_w + T_s + T_d$ ), and peak learning rate $\eta_{\max}$ , the schedule is:

$\eta(t) = \begin{cases} \eta_{\max} \cdot \frac{t}{T_w} & 0 \leq t \leq T_w \quad \text{(warmup)} \ \eta_{\max} & T_w < t \leq T_w + T_s \quad \text{(stable/plateau)} \ \eta_{\max} \cdot \left(1 - \frac{t - (T_w + T_s)}{T_d}\right) & T_w + T_s < t \leq T \quad \text{(decay)} \end{cases}$

Variants extend this basic scheme with different decay shapes (e.g., linear, cosine, "sqrt," lowered-linear), polynomial or exponential forms, and adjustable minimum endpoint. The standard is to choose stable fraction $\beta = T_s/T$ in the range $0.6$–$0.8$, decay fraction $1-\beta$ , and a brief linear warmup ($0.5$– $2\%$ of $T$ ) for early stabilization (Belloni et al., 13 Jan 2026, Dremov et al., 2 Aug 2025, Schaipp et al., 31 Jan 2025, Wen et al., 2024, Shen et al., 2024, Luo et al., 17 Mar 2025). The parameterization is robust to architecture, data, and model size.

2. Theoretical Foundations and Optimization Bounds

WSD scheduling is grounded in both convex and non-convex optimization theory. For the convex Lipschitz case ( $\|g_t\|^2 \leq G^2$ ), the final expected suboptimality of an arbitrary schedule $\{\eta_1, \ldots, \eta_T\}$ at base step $\gamma > 0$ admits the bound (Schaipp et al., 31 Jan 2025):

$\mathbb{E}[f(x_T) - f(x^*)] \leq \Omega_T(\gamma) := T_1 / \gamma + \gamma \cdot T_2$

with

$T_1 = \frac{D^2}{2 \sum_{t=1}^T \eta_t} \quad ; \quad T_2 = \frac{1}{2}\left[ \frac{\sum_{t=1}^T \eta_t^2 G^2}{\sum_{t=1}^T \eta_t} + \sum_{k=1}^{T-1} \frac{\eta_k}{\sum_{t=k+1}^T \eta_t} \cdot \frac{\sum_{t=k}^T \eta_t^2 G^2}{\sum_{t=k}^T \eta_t} \right]$

where $D=\|x_1-x^*\|$ . The optimal base rate $\gamma^*=\sqrt{T_1/T_2}$ gives

$\mathbb{E}[f(x_T) - f(x^*)] \leq 2\sqrt{T_1 T_2}$

For the WSD schedule with stable fraction $\beta = T_s/T$ and linear decay, the bound refines to: $\mathbb{E}[f(x_T) - f(x^*)] \lesssim \frac{DG}{\sqrt{T}} \cdot \frac{1}{\sqrt{(4/(1+\beta)) \cdot (5/3 + \ln((1+\beta)/(1-\beta)) + O(1/T))}}$ Notably, logarithmic terms in $T$ that drive slow convergence in vanilla constant or cosine schedules are eliminated if any linear decay is present ( $\beta < 1$ ), underscoring the efficiency of WSD.

3. Empirical Loss Landscape and Training Dynamics

WSD schedules induce distinctive training curves characterized by flat loss during the plateau and a sharp drop during decay. The “river-valley” loss landscape framework (Wen et al., 2024, Belloni et al., 13 Jan 2026, Dremov et al., 2 Aug 2025) describes two orthogonal parameter directions:

River direction: flat manifold where high constant learning rate drives noisy but rapid progress.
Mountain direction: sharp curvature, where oscillations dominate during plateau, suppressed only as learning rate decays.

During warmup, SGD equilibrates sharp directions, avoiding gradient instability. Plateau maintains exploration along the flat manifold, but stochastic oscillations impede visible loss improvement. Decay phase minimizes oscillations, revealing accumulated progress: the loss curve drops steeply. This phenomenon is universal across transformers and CNNs and is explained by quasi-convexity of the iterates and progressive sharpening of local minima during decay (Belloni et al., 13 Jan 2026, Wen et al., 2024).

4. Decay Shape, Bias-Variance Trade-off, and Hyperparameter Effects

Numerous decay kernels have been empirically compared (linear, cosine, "sqrt," lowered-linear, power-law) (Dremov et al., 2 Aug 2025, Luo et al., 17 Mar 2025):

$\begin{array}{ll} \text{Linear:} & \eta(t) = \eta_{\max}(1-x) \ \text{Cosine:} & \eta(t) = \frac{1}{2} \eta_{\max}(1+\cos(\pi x)) \ \text{Mirror-cosine:} & \eta(t) = \eta_{\max}[2(1-x)-(1+\cos(\pi x))/2] \ \text{Square:} & \eta_{\max}(1-x^2) \ \text{Sqrt:} & \eta_{\max}(1-\sqrt{x}) \ \text{Lowered-linear:} & \eta_{\max}(1-\alpha x), \;\alpha\in[0,1] \end{array}$

where $x$ is the normalized decay fraction.

Balanced decay shapes, notably "sqrt" and lowered-linear with $\alpha \approx 0.7$ , yield minimizers on the bias–variance frontier, typically improving validation perplexity by $0.2$–$0.3$ points and downstream metrics by $0.2$– $0.5\%$ relative to standard linear or mirror-cosine decays (Dremov et al., 2 Aug 2025). Tuning optimizer hyperparameters, especially raising AdamW’s $\beta_2$ towards $0.98$–$0.995$ during decay, matches the effect size of optimal decay shape selection. Weight decay and batch size effects are less pronounced in the decay regime.

5. Scaling Laws, Data Efficiency, and Batch Size Optimization

Recent advances incorporate scaling law analysis of WSD schedules using functional scaling law (FSL) and E(S) data-efficiency relationships (Zhou et al., 8 Jan 2026, Li et al., 23 Sep 2025). In these frameworks, the WSD schedule yields a three-regime $E(S)$ curve for pretraining efficiency:

Inverse-linear regime for small step counts: $E(S) \sim 1/(S-S_{min})$
Quadratic transition regime near the global minimum
Linear regime at large batch sizes: $E(S) \sim B_{min} S$ , with explicit $B_{min}, B_{opt}$ values

By structuring training with long stable plateaus and short decays ( $\sim$ 10–25%), WSD eliminates logarithmic slowdowns, maximizes the “full-batch” decay term, and empirically achieves lower final validation risk per token than comparable cosine or direct exponential schedules (Li et al., 23 Sep 2025, Zhou et al., 8 Jan 2026). Dynamic batch size scheduling based on fitted $B_{opt}(D)$ further improves both convergence speed and downstream accuracy on LLM benchmarks.

6. Advanced Variants and Checkpoint-Averaged Alternatives

Variants of WSD have emerged to further optimize or simplify training. The WSD-S recipe (Wen et al., 2024) recycles decay progress for subsequent compute budgets, enabling one-shot generation of checkpoints for arbitrary corpus volumes. The WSM framework (Tian et al., 23 Jul 2025) dispenses with online learning rate decay, instead merging several recent checkpoints with theoretically derived weights to offline-synthesize effectively decayed solutions. WSM has demonstrated consistent, robust improvements (+3.5% MATH, +2.9% HumanEval, +5.5% MMLU-Pro) over classical WSD across multiple pretraining and fine-tuning tasks.

7. Practical Guidelines and Open Research Directions

Practical recommendations for WSD scheduling include:

Warmup: typically 1–2% of total steps; linear ramp to stabilize gradients
Stable plateau: 60–80% of training; maximal exploration and noise injection
Decay: 10–25% of training; use sqrt or lowered-linear shapes for optimal bias–variance trade-off, and tune $\beta_2$ to at least $0.98$ during decay
Batch size: optimize using $B_{opt}$ for global data efficiency, schedule dynamically as training progresses
LR transfer: empirically, $\gamma^*(\beta)$ scale relations enable zero-shot transfer of optimal rates across decay schedule variants (Schaipp et al., 31 Jan 2025, Shen et al., 2024)

Open questions concern the universality of WSD geometry beyond AdamW and SGD, the optimality of multi-stage or nonzero-floor decays, theoretical interpolation between convex bounds and non-convex trajectories, and explicit dependence on overparameterization and data heterogeneity (Belloni et al., 13 Jan 2026, Wen et al., 2024).

Summary Table: Canonical WSD Settings and Effects

Phase	Fraction of total steps	Typical Setting
Warmup	1–2%	Linear $\eta\,0 \rightarrow \eta_{\max}$
Stable plateau	60–80%	$\eta = \eta_{\max}$
Decay	10–25%	Sqrt/low-linear shape to $\sim 0$

WSD learning rate scheduling synthesizes rigorous optimization theory, nuanced loss landscape geometry, and sample-efficient scaling law insights, and now anchors large-model training at scale. The approach maximizes progress along flat directions while suppressing sharp-mode oscillations at the critical final phase, yielding state-of-the-art convergence, reduced validation risk, and high transferability across architectures (Schaipp et al., 31 Jan 2025, Dremov et al., 2 Aug 2025, Belloni et al., 13 Jan 2026, Wen et al., 2024, Li et al., 23 Sep 2025, Tian et al., 23 Jul 2025, Luo et al., 17 Mar 2025, Shen et al., 2024, Zhou et al., 8 Jan 2026).