FG-WSD: Fine-Grained Warmup–Stable–Decay
- FG-WSD is a pretraining scheduler that introduces a phased increase in high-quality data during the stable plateau to enhance model reasoning.
- It decouples abrupt learning rate changes from data mixture shifts, ensuring smoother optimization and mitigating training instabilities.
- Empirical results on Nanbeige4-3B reveal improvements of up to +7.2 on benchmarks like GSM8K, underscoring its effectiveness.
Fine-Grained Warmup–Stable–Decay (FG-WSD) is a pretraining scheduler for LLMs that jointly modulates learning rate and training data mixture. FG-WSD introduces a fine-grained design in the stable (plateau) regime, progressively increasing the proportion of high-quality (HQ) data over multiple sub-phases, and maintains a clear decoupling between data-mixture transitions and learning-rate changes. This approach was introduced in the pretraining procedure of the Nanbeige4-3B family of LLMs, where it demonstrated significant gains over conventional schedulers, particularly on reasoning-heavy tasks (Yang et al., 6 Dec 2025).
1. Motivation and Theoretical Rationale
FG-WSD arises from two key limitations in existing learning-rate and data-mixing schedules. First, while a standard Warmup–Stable–Decay (WSD) learning-rate profile empirically outperforms Warmup–Cosine–Decay when using very high-quality datasets, the uniform data mixture during the stable plateau is suboptimal. The model is effectively exposed to the same data-quality distribution (often 1:1 HQ vs. medium-quality) throughout this period. FG-WSD targets these limitations by introducing multiple sub-phases during the stable plateau, with each sub-phase using a higher proportion of HQ data and by decoupling learning-rate shifts from abrupt changes in data mixture.
This design enables an early-stage "exploration" period—when the model benefits from broader diversity—followed by a later "exploitation" phase prioritizing only the best data. This prevents undesirable optimization instabilities that can be caused by simultaneous drastic changes in learning rate and data distribution.
2. Formal Description and Mathematical Framework
Let denote the cumulative number of training tokens. FG-WSD divides training into four principal stages, parameterized by token horizons:
- : warm-up horizon;
- : diversity-enriched stable phase;
- : HQ-only stable phase;
- : decay horizon.
The maximum plateau learning rate is and the minimum end-of-decay learning rate is . Schedules are as follows:
Learning-rate schedule:
- : (linear warmup).
- : (constant).
- :
- Training may terminate beyond this or continue at .
Data-mixture schedule: Let denote the probability of drawing a sample from HQ data at time :
- : (often close to the first stable phase's proportion).
- : (e.g., 50% HQ, 50% medium-quality).
- : (HQ only).
- During decay, maintain .
This staged, piecewise-constant schedule ensures smooth transitions for both learning rate and data mixture.
3. Hyperparameter Settings and Practical Instantiation
For Nanbeige4-3B-Base, FG-WSD was instantiated with the following parameters (token units: trillions, ):
| Stage | Tokens | Learning Rate Behavior |
|---|---|---|
| Warmup | 0.1 T | linear |
| Diversity-Enriched | 12.4 T | constant |
| HQ Stable | 6.5 T | constant |
| Decay | 4.0 T |
- T (warm-up)
- T (diversity-enriched)
- T (HQ stable)
- T (decay)
- Data mixture: diversity-enriched phase uses roughly HQ, other filtered data; HQ-stable and decay phases use only HQ.
Hyperparameters were derived from ablations on a 1 B-parameter model with a toy 1 T-token corpus: splitting the stable phase ($0.75$ T mixed medium/HQ, $0.25$ T HQ only) yielded noticeable gains, motivating proportional scaling to larger runs.
4. Data-Mixture Schedule and Evolution
FG-WSD specifies the evolution of training data mixture alongside learning rate. The four main phases are:
- Warmup ($0.1$ T): Data mixture approximates the ensuing diversity-enriched phase.
- Diversity-Enriched Stable ($12.4$ T): Mini-batches sampled from the full corpus with approximately HQ and medium/filtered data.
- HQ Stable ($6.5$ T): Sampling exclusively from the HQ subset.
- Decay ($4.0$ T): Continued exclusive HQ sampling, with a decaying learning rate.
Earlier ablations compared vanilla WSD (uniform 1:1 shuffle of medium and high-quality data) to FG-WSD, confirming benefits from prolonging the mixed phase and carefully transitioning to HQ only.
5. Pseudocode and Training Workflow
The FG-WSD scheduler is implemented as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
initialize model parameters θ set ηmax = 4.5e-4, ηmin = 1.5e-6 set Tw = 0.1*T_total, Ts1 = 12.4*T_total, Ts2 = 6.5*T_total, Td = 4.0*T_total define data pools: Pool_HQ, Pool_MQ (and lower) for token_count t from 1 to T_total: # 1. Determine LR if t <= Tw: η = ηmax * (t / Tw) else if t <= Tw + Ts1 + Ts2: η = ηmax else if t <= Tw + Ts1 + Ts2 + Td: τ = (t - (Tw + Ts1 + Ts2)) / Td η = ηmin + 0.5*(ηmax-ηmin)*(1 + cos(pi * τ)) else: η = ηmin # 2. Determine data-mixture weight wHQ if t <= Tw: wHQ = Ts1 > 0 ? Ts1/(Ts1+Ts2) : 0.0 # or preset αwarmup else if t <= Tw + Ts1: wHQ = Pool_HQ_size / (Pool_HQ_size + Pool_MQ_size) # ~0.5 else: wHQ = 1.0 # 3. Sample a mini-batch for each sample in batch: if random() < wHQ: x ← sample_from(Pool_HQ) else: x ← sample_from(Pool_MQ ∪ others) # 4. Take optimizer step at learning rate η loss = compute_loss(θ, x) θ ← θ - η * ∇_θ loss |
6. Experimental Results and Empirical Impact
On a 1 B-parameter model with 1 T tokens of pretraining, FG-WSD achieved substantial improvements over vanilla WSD, most notably on mathematical reasoning tasks:
| Metric | Vanilla WSD | FG-WSD | Absolute Gain |
|---|---|---|---|
| GSM8K | 27.1 | 34.3 | +7.2 |
| CMath | 34.5 | 39.5 | +5.0 |
| BBH | 29.3 | 31.6 | +2.3 |
| MMLU | 49.2 | 50.6 | +1.4 |
| CMMLU | 50.3 | 51.9 | +1.6 |
| MMLU-Pro | 16.87 | 18.64 | +1.77 |
Gains were especially pronounced on reasoning benchmarks, reflecting the positive effect of concentrating on HQ, reasoning-dense data in the latter stages. These improvements motivated the scaling and full deployment of FG-WSD in the 23 T-token pretraining of Nanbeige4-3B, contributing to state-of-the-art few-/zero-shot performance among $3$ B-parameter models (Yang et al., 6 Dec 2025).
7. Context and Implications
FG-WSD represents a shift in scheduler design for large-scale LM pretraining, emphasizing decoupled and staged optimization of both learning-rate and data-mixture profiles. The results in Nanbeige4-3B suggest that the joint fine-graining of plateau phase data curation and smooth learning-rate transitions is particularly effective for maximizing generalization and reasoning ability at sub-10B parameter scales. A plausible implication is that future schedulers may increasingly incorporate granular, data-aware control at the epoch or token level to further leverage high-quality data advantages. The use of such schedulers is predicated on rigorous data curation and quality filtering capabilities.