FG-WSD: Fine-Grained Warmup–Stable–Decay

Updated 20 February 2026

FG-WSD is a pretraining scheduler that introduces a phased increase in high-quality data during the stable plateau to enhance model reasoning.
It decouples abrupt learning rate changes from data mixture shifts, ensuring smoother optimization and mitigating training instabilities.
Empirical results on Nanbeige4-3B reveal improvements of up to +7.2 on benchmarks like GSM8K, underscoring its effectiveness.

Fine-Grained Warmup–Stable–Decay (FG-WSD) is a pretraining scheduler for LLMs that jointly modulates learning rate and training data mixture. FG-WSD introduces a fine-grained design in the stable (plateau) regime, progressively increasing the proportion of high-quality (HQ) data over multiple sub-phases, and maintains a clear decoupling between data-mixture transitions and learning-rate changes. This approach was introduced in the pretraining procedure of the Nanbeige4-3B family of LLMs, where it demonstrated significant gains over conventional schedulers, particularly on reasoning-heavy tasks (Yang et al., 6 Dec 2025).

1. Motivation and Theoretical Rationale

FG-WSD arises from two key limitations in existing learning-rate and data-mixing schedules. First, while a standard Warmup–Stable–Decay (WSD) learning-rate profile empirically outperforms Warmup–Cosine–Decay when using very high-quality datasets, the uniform data mixture during the stable plateau is suboptimal. The model is effectively exposed to the same data-quality distribution (often 1:1 HQ vs. medium-quality) throughout this period. FG-WSD targets these limitations by introducing multiple sub-phases during the stable plateau, with each sub-phase using a higher proportion of HQ data and by decoupling learning-rate shifts from abrupt changes in data mixture.

This design enables an early-stage "exploration" period—when the model benefits from broader diversity—followed by a later "exploitation" phase prioritizing only the best data. This prevents undesirable optimization instabilities that can be caused by simultaneous drastic changes in learning rate and data distribution.

2. Formal Description and Mathematical Framework

Let $t$ denote the cumulative number of training tokens. FG-WSD divides training into four principal stages, parameterized by token horizons:

$T_{\mathrm{w}}$ : warm-up horizon;
$T_{\mathrm{s}1}$ : diversity-enriched stable phase;
$T_{\mathrm{s}2}$ : HQ-only stable phase;
$T_{\mathrm{d}}$ : decay horizon.

The maximum plateau learning rate is $\eta_{\max}$ and the minimum end-of-decay learning rate is $\eta_{\min}$ . Schedules are as follows:

Learning-rate schedule:

$0 \le t \le T_{\mathrm{w}}$ : $\eta(t) = \eta_{\max} \cdot (t/T_{\mathrm{w}})$ (linear warmup).
$T_{\mathrm{w}} < t \le T_{\mathrm{w}} + T_{\mathrm{s}1} + T_{\mathrm{s}2}$ : $\eta(t) = \eta_{\max}$ (constant).
$T_{\mathrm{w}} + T_{\mathrm{s}1} + T_{\mathrm{s}2} < t \le T_{\mathrm{w}} + T_{\mathrm{s}1} + T_{\mathrm{s}2} + T_{\mathrm{d}}$ :

$\eta(t) = \eta_{\min} + \frac{1}{2} (\eta_{\max} - \eta_{\min}) \big[1 + \cos\big(\pi \frac{t - (T_{\mathrm{w}} + T_{\mathrm{s}1} + T_{\mathrm{s}2})}{T_{\mathrm{d}}}\big)\big]$

Training may terminate beyond this or continue at $\eta_{\min}$ .

Data-mixture schedule: Let $w_\mathrm{HQ}(t)$ denote the probability of drawing a sample from HQ data at time $t$ :

$0 \le t \le T_{\mathrm{w}}$ : $w_\mathrm{HQ}(t) = \alpha_\mathrm{warmup}$ (often close to the first stable phase's proportion).
$T_{\mathrm{w}} < t \le T_{\mathrm{w}} + T_{\mathrm{s}1}$ : $w_\mathrm{HQ}(t) = \alpha_\mathrm{S1}$ (e.g., 50% HQ, 50% medium-quality).
$T_{\mathrm{w}} + T_{\mathrm{s}1} < t \le T_{\mathrm{w}} + T_{\mathrm{s}1} + T_{\mathrm{s}2}$ : $w_\mathrm{HQ}(t) = 1.0$ (HQ only).
During decay, maintain $w_\mathrm{HQ} = 1$ .

This staged, piecewise-constant schedule ensures smooth transitions for both learning rate and data mixture.

3. Hyperparameter Settings and Practical Instantiation

For Nanbeige4-3B-Base, FG-WSD was instantiated with the following parameters (token units: trillions, $T=10^{12}$ ):

Stage	Tokens	Learning Rate Behavior
Warmup	0.1 T	linear $0 \rightarrow 4.5 \times 10^{-4}$
Diversity-Enriched	12.4 T	constant $4.5 \times 10^{-4}$
HQ Stable	6.5 T	constant $4.5 \times 10^{-4}$
Decay	4.0 T	$4.5 \times 10^{-4} \rightarrow 1.5 \times 10^{-6}$

$T_\mathrm{w} = 0.1$ T (warm-up)
$T_{\mathrm{s}1} = 12.4$ T (diversity-enriched)
$T_{\mathrm{s}2} = 6.5$ T (HQ stable)
$T_\mathrm{d} = 4.0$ T (decay)
$\eta_{\max} = 4.5 \times 10^{-4}, \eta_{\min} = 1.5 \times 10^{-6}$
Data mixture: diversity-enriched phase uses roughly $50\%$ HQ, $50\%$ other filtered data; HQ-stable and decay phases use only HQ.

Hyperparameters were derived from ablations on a 1 B-parameter model with a toy 1 T-token corpus: splitting the stable phase ($0.75$ T mixed medium/HQ, $0.25$ T HQ only) yielded noticeable gains, motivating proportional scaling to larger runs.

4. Data-Mixture Schedule and Evolution

FG-WSD specifies the evolution of training data mixture alongside learning rate. The four main phases are:

Warmup ($0.1$ T): Data mixture approximates the ensuing diversity-enriched phase.
Diversity-Enriched Stable ($12.4$ T): Mini-batches sampled from the full corpus with approximately $50\%$ HQ and $50\%$ medium/filtered data.
HQ Stable ($6.5$ T): Sampling exclusively from the HQ subset.
Decay ($4.0$ T): Continued exclusive HQ sampling, with a decaying learning rate.

Earlier ablations compared vanilla WSD (uniform 1:1 shuffle of medium and high-quality data) to FG-WSD, confirming benefits from prolonging the mixed phase and carefully transitioning to HQ only.

5. Pseudocode and Training Workflow

The FG-WSD scheduler is implemented as follows:

initialize model parameters θ
set ηmax = 4.5e-4, ηmin = 1.5e-6
set Tw = 0.1*T_total,  Ts1 = 12.4*T_total,  Ts2 = 6.5*T_total,  Td = 4.0*T_total
define data pools: Pool_HQ, Pool_MQ (and lower)
for token_count t from 1 to T_total:
  # 1. Determine LR
  if t <= Tw:
    η = ηmax * (t / Tw)
  else if t <= Tw + Ts1 + Ts2:
    η = ηmax
  else if t <= Tw + Ts1 + Ts2 + Td:
    τ = (t - (Tw + Ts1 + Ts2)) / Td
    η = ηmin + 0.5*(ηmax-ηmin)*(1 + cos(pi * τ))
  else:
    η = ηmin
  # 2. Determine data-mixture weight wHQ
  if t <= Tw:
    wHQ = Ts1 > 0 ? Ts1/(Ts1+Ts2) : 0.0   # or preset αwarmup
  else if t <= Tw + Ts1:
    wHQ = Pool_HQ_size / (Pool_HQ_size + Pool_MQ_size)   # ~0.5
  else:
    wHQ = 1.0
  # 3. Sample a mini-batch
  for each sample in batch:
    if random() < wHQ:
      x ← sample_from(Pool_HQ)
    else:
      x ← sample_from(Pool_MQ ∪ others)
  # 4. Take optimizer step at learning rate η
  loss = compute_loss(θ, x)
  θ ← θ - η * ∇_θ loss

6. Experimental Results and Empirical Impact

On a 1 B-parameter model with 1 T tokens of pretraining, FG-WSD achieved substantial improvements over vanilla WSD, most notably on mathematical reasoning tasks:

Metric	Vanilla WSD	FG-WSD	Absolute Gain
GSM8K	27.1	34.3	+7.2
CMath	34.5	39.5	+5.0
BBH	29.3	31.6	+2.3
MMLU	49.2	50.6	+1.4
CMMLU	50.3	51.9	+1.6
MMLU-Pro	16.87	18.64	+1.77

Gains were especially pronounced on reasoning benchmarks, reflecting the positive effect of concentrating on HQ, reasoning-dense data in the latter stages. These improvements motivated the scaling and full deployment of FG-WSD in the 23 T-token pretraining of Nanbeige4-3B, contributing to state-of-the-art few-/zero-shot performance among $3$ B-parameter models (Yang et al., 6 Dec 2025).

7. Context and Implications

FG-WSD represents a shift in scheduler design for large-scale LM pretraining, emphasizing decoupled and staged optimization of both learning-rate and data-mixture profiles. The results in Nanbeige4-3B suggest that the joint fine-graining of plateau phase data curation and smooth learning-rate transitions is particularly effective for maximizing generalization and reasoning ability at sub-10B parameter scales. A plausible implication is that future schedulers may increasingly incorporate granular, data-aware control at the epoch or token level to further leverage high-quality data advantages. The use of such schedulers is predicated on rigorous data curation and quality filtering capabilities.

Markdown Report Issue Upgrade to Chat

References (1)

Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fine-Grained Warmup–Stable–Decay (FG-WSD).

FG-WSD: Fine-Grained Warmup–Stable–Decay

1. Motivation and Theoretical Rationale

2. Formal Description and Mathematical Framework

3. Hyperparameter Settings and Practical Instantiation

4. Data-Mixture Schedule and Evolution

5. Pseudocode and Training Workflow

6. Experimental Results and Empirical Impact

7. Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

FG-WSD: Fine-Grained Warmup–Stable–Decay

1. Motivation and Theoretical Rationale

2. Formal Description and Mathematical Framework

3. Hyperparameter Settings and Practical Instantiation

4. Data-Mixture Schedule and Evolution

5. Pseudocode and Training Workflow

6. Experimental Results and Empirical Impact

7. Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research