Papers
Topics
Authors
Recent
Search
2000 character limit reached

FG-WSD: Fine-Grained Warmup–Stable–Decay

Updated 20 February 2026
  • FG-WSD is a pretraining scheduler that introduces a phased increase in high-quality data during the stable plateau to enhance model reasoning.
  • It decouples abrupt learning rate changes from data mixture shifts, ensuring smoother optimization and mitigating training instabilities.
  • Empirical results on Nanbeige4-3B reveal improvements of up to +7.2 on benchmarks like GSM8K, underscoring its effectiveness.

Fine-Grained Warmup–Stable–Decay (FG-WSD) is a pretraining scheduler for LLMs that jointly modulates learning rate and training data mixture. FG-WSD introduces a fine-grained design in the stable (plateau) regime, progressively increasing the proportion of high-quality (HQ) data over multiple sub-phases, and maintains a clear decoupling between data-mixture transitions and learning-rate changes. This approach was introduced in the pretraining procedure of the Nanbeige4-3B family of LLMs, where it demonstrated significant gains over conventional schedulers, particularly on reasoning-heavy tasks (Yang et al., 6 Dec 2025).

1. Motivation and Theoretical Rationale

FG-WSD arises from two key limitations in existing learning-rate and data-mixing schedules. First, while a standard Warmup–Stable–Decay (WSD) learning-rate profile empirically outperforms Warmup–Cosine–Decay when using very high-quality datasets, the uniform data mixture during the stable plateau is suboptimal. The model is effectively exposed to the same data-quality distribution (often 1:1 HQ vs. medium-quality) throughout this period. FG-WSD targets these limitations by introducing multiple sub-phases during the stable plateau, with each sub-phase using a higher proportion of HQ data and by decoupling learning-rate shifts from abrupt changes in data mixture.

This design enables an early-stage "exploration" period—when the model benefits from broader diversity—followed by a later "exploitation" phase prioritizing only the best data. This prevents undesirable optimization instabilities that can be caused by simultaneous drastic changes in learning rate and data distribution.

2. Formal Description and Mathematical Framework

Let tt denote the cumulative number of training tokens. FG-WSD divides training into four principal stages, parameterized by token horizons:

  • TwT_{\mathrm{w}}: warm-up horizon;
  • Ts1T_{\mathrm{s}1}: diversity-enriched stable phase;
  • Ts2T_{\mathrm{s}2}: HQ-only stable phase;
  • TdT_{\mathrm{d}}: decay horizon.

The maximum plateau learning rate is ηmax\eta_{\max} and the minimum end-of-decay learning rate is ηmin\eta_{\min}. Schedules are as follows:

Learning-rate schedule:

  • 0tTw0 \le t \le T_{\mathrm{w}}: η(t)=ηmax(t/Tw)\eta(t) = \eta_{\max} \cdot (t/T_{\mathrm{w}}) (linear warmup).
  • Tw<tTw+Ts1+Ts2T_{\mathrm{w}} < t \le T_{\mathrm{w}} + T_{\mathrm{s}1} + T_{\mathrm{s}2}: η(t)=ηmax\eta(t) = \eta_{\max} (constant).
  • Tw+Ts1+Ts2<tTw+Ts1+Ts2+TdT_{\mathrm{w}} + T_{\mathrm{s}1} + T_{\mathrm{s}2} < t \le T_{\mathrm{w}} + T_{\mathrm{s}1} + T_{\mathrm{s}2} + T_{\mathrm{d}}:

η(t)=ηmin+12(ηmaxηmin)[1+cos(πt(Tw+Ts1+Ts2)Td)]\eta(t) = \eta_{\min} + \frac{1}{2} (\eta_{\max} - \eta_{\min}) \big[1 + \cos\big(\pi \frac{t - (T_{\mathrm{w}} + T_{\mathrm{s}1} + T_{\mathrm{s}2})}{T_{\mathrm{d}}}\big)\big]

  • Training may terminate beyond this or continue at ηmin\eta_{\min}.

Data-mixture schedule: Let wHQ(t)w_\mathrm{HQ}(t) denote the probability of drawing a sample from HQ data at time tt:

  • 0tTw0 \le t \le T_{\mathrm{w}}: wHQ(t)=αwarmupw_\mathrm{HQ}(t) = \alpha_\mathrm{warmup} (often close to the first stable phase's proportion).
  • Tw<tTw+Ts1T_{\mathrm{w}} < t \le T_{\mathrm{w}} + T_{\mathrm{s}1}: wHQ(t)=αS1w_\mathrm{HQ}(t) = \alpha_\mathrm{S1} (e.g., 50% HQ, 50% medium-quality).
  • Tw+Ts1<tTw+Ts1+Ts2T_{\mathrm{w}} + T_{\mathrm{s}1} < t \le T_{\mathrm{w}} + T_{\mathrm{s}1} + T_{\mathrm{s}2}: wHQ(t)=1.0w_\mathrm{HQ}(t) = 1.0 (HQ only).
  • During decay, maintain wHQ=1w_\mathrm{HQ} = 1.

This staged, piecewise-constant schedule ensures smooth transitions for both learning rate and data mixture.

3. Hyperparameter Settings and Practical Instantiation

For Nanbeige4-3B-Base, FG-WSD was instantiated with the following parameters (token units: trillions, T=1012T=10^{12}):

Stage Tokens Learning Rate Behavior
Warmup 0.1 T linear 04.5×1040 \rightarrow 4.5 \times 10^{-4}
Diversity-Enriched 12.4 T constant 4.5×1044.5 \times 10^{-4}
HQ Stable 6.5 T constant 4.5×1044.5 \times 10^{-4}
Decay 4.0 T 4.5×1041.5×1064.5 \times 10^{-4} \rightarrow 1.5 \times 10^{-6}
  • Tw=0.1T_\mathrm{w} = 0.1 T (warm-up)
  • Ts1=12.4T_{\mathrm{s}1} = 12.4 T (diversity-enriched)
  • Ts2=6.5T_{\mathrm{s}2} = 6.5 T (HQ stable)
  • Td=4.0T_\mathrm{d} = 4.0 T (decay)
  • ηmax=4.5×104,ηmin=1.5×106\eta_{\max} = 4.5 \times 10^{-4}, \eta_{\min} = 1.5 \times 10^{-6}
  • Data mixture: diversity-enriched phase uses roughly 50%50\% HQ, 50%50\% other filtered data; HQ-stable and decay phases use only HQ.

Hyperparameters were derived from ablations on a 1 B-parameter model with a toy 1 T-token corpus: splitting the stable phase ($0.75$ T mixed medium/HQ, $0.25$ T HQ only) yielded noticeable gains, motivating proportional scaling to larger runs.

4. Data-Mixture Schedule and Evolution

FG-WSD specifies the evolution of training data mixture alongside learning rate. The four main phases are:

  • Warmup ($0.1$ T): Data mixture approximates the ensuing diversity-enriched phase.
  • Diversity-Enriched Stable ($12.4$ T): Mini-batches sampled from the full corpus with approximately 50%50\% HQ and 50%50\% medium/filtered data.
  • HQ Stable ($6.5$ T): Sampling exclusively from the HQ subset.
  • Decay ($4.0$ T): Continued exclusive HQ sampling, with a decaying learning rate.

Earlier ablations compared vanilla WSD (uniform 1:1 shuffle of medium and high-quality data) to FG-WSD, confirming benefits from prolonging the mixed phase and carefully transitioning to HQ only.

5. Pseudocode and Training Workflow

The FG-WSD scheduler is implemented as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
initialize model parameters θ
set ηmax = 4.5e-4, ηmin = 1.5e-6
set Tw = 0.1*T_total,  Ts1 = 12.4*T_total,  Ts2 = 6.5*T_total,  Td = 4.0*T_total
define data pools: Pool_HQ, Pool_MQ (and lower)
for token_count t from 1 to T_total:
  # 1. Determine LR
  if t <= Tw:
    η = ηmax * (t / Tw)
  else if t <= Tw + Ts1 + Ts2:
    η = ηmax
  else if t <= Tw + Ts1 + Ts2 + Td:
    τ = (t - (Tw + Ts1 + Ts2)) / Td
    η = ηmin + 0.5*(ηmax-ηmin)*(1 + cos(pi * τ))
  else:
    η = ηmin
  # 2. Determine data-mixture weight wHQ
  if t <= Tw:
    wHQ = Ts1 > 0 ? Ts1/(Ts1+Ts2) : 0.0   # or preset αwarmup
  else if t <= Tw + Ts1:
    wHQ = Pool_HQ_size / (Pool_HQ_size + Pool_MQ_size)   # ~0.5
  else:
    wHQ = 1.0
  # 3. Sample a mini-batch
  for each sample in batch:
    if random() < wHQ:
      x  sample_from(Pool_HQ)
    else:
      x  sample_from(Pool_MQ  others)
  # 4. Take optimizer step at learning rate η
  loss = compute_loss(θ, x)
  θ  θ - η * _θ loss

6. Experimental Results and Empirical Impact

On a 1 B-parameter model with 1 T tokens of pretraining, FG-WSD achieved substantial improvements over vanilla WSD, most notably on mathematical reasoning tasks:

Metric Vanilla WSD FG-WSD Absolute Gain
GSM8K 27.1 34.3 +7.2
CMath 34.5 39.5 +5.0
BBH 29.3 31.6 +2.3
MMLU 49.2 50.6 +1.4
CMMLU 50.3 51.9 +1.6
MMLU-Pro 16.87 18.64 +1.77

Gains were especially pronounced on reasoning benchmarks, reflecting the positive effect of concentrating on HQ, reasoning-dense data in the latter stages. These improvements motivated the scaling and full deployment of FG-WSD in the 23 T-token pretraining of Nanbeige4-3B, contributing to state-of-the-art few-/zero-shot performance among $3$ B-parameter models (Yang et al., 6 Dec 2025).

7. Context and Implications

FG-WSD represents a shift in scheduler design for large-scale LM pretraining, emphasizing decoupled and staged optimization of both learning-rate and data-mixture profiles. The results in Nanbeige4-3B suggest that the joint fine-graining of plateau phase data curation and smooth learning-rate transitions is particularly effective for maximizing generalization and reasoning ability at sub-10B parameter scales. A plausible implication is that future schedulers may increasingly incorporate granular, data-aware control at the epoch or token level to further leverage high-quality data advantages. The use of such schedulers is predicated on rigorous data curation and quality filtering capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fine-Grained Warmup–Stable–Decay (FG-WSD).