Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Whole-Word Masking Curriculum

Updated 5 February 2026
  • The paper introduces a dynamic masking curriculum that adjusts the whole-word mask ratio from 15% to 30% during pre-training to balance global recovery and local semantic refinement.
  • It employs a mathematically defined, piecewise linear schedule integrated with specialized tokenization and attention mechanisms to optimize semantic coherence.
  • Empirical evaluations show that this adaptive approach improves performance on Chinese language benchmarks, yielding gains on CLUE and enhanced correlations on SimCLUE tasks.

Whole-word masking with a dynamic curriculum is a pre-training strategy in encoder-only Transformer models—exemplified by Chinese ModernBERT—that synchronizes the masking ratio with training progress, transitioning from high to low packet-level corruption rates. This approach aims to balance global and local reasoning, leveraging masking schedules that evolve with the optimization horizon. The method is operationalized through a mathematically precise schedule, specialized tokenization, integration with system-level optimizations, and is empirically validated on complex Chinese language understanding and retrieval benchmarks (Zhao et al., 14 Oct 2025).

1. Mathematical Basis for the Dynamic Masking Curriculum

The dynamic curriculum is defined by a piecewise linear schedule that adjusts the whole-word mask ratio ε(s)\varepsilon(s) as pre-training advances. Let SS denote the total number of steps, SwS_w the warmup length, and ss the current step. The mask ratios are bounded: εmin=0.15\varepsilon_{\min} = 0.15, εmax=0.30\varepsilon_{\max} = 0.30. Formally,

$\varepsilon(s) = \begin{cases} \varepsilon_{\min} + (\varepsilon_{\max} - \varepsilon_{\min})\,\frac{s}{S_w}, & 0 \le s \le S_w\[1em] \varepsilon_{\max} - (\varepsilon_{\max} - \varepsilon_{\min})\, \frac{s - S_w}{S - S_w}, & S_w < s \le S \end{cases}$

During the warmup (sSws \le S_w), the mask ratio linearly increases from 15% to 30%, incentivizing increased masking early in training. Subsequently (s>Sws > S_w), the ratio linearly decays from 30% down to 15%, gradually lowering the corruption rate to enable local semantic refinement.

2. Operational Workflow: Whole-Word Masking with Scheduling

The masking procedure is synchronized with the dynamic ratio ε(s)\varepsilon(s). The input is tokenized, and contiguous subwords are grouped as full-word spans. Words are randomly sampled, and their constituent tokens are masked until the expected proportion of masked tokens meets ε(s)\varepsilon(s). Mask application is performed as follows: 80% of word tokens are replaced with [MASK], 10% with a random token, and 10% are left unchanged. This ensures semantic coherence by masking entire words instead of isolated tokens, and guarantees that the running mask ratio tracks the evolving curriculum.

3. Motivations, Ablations, and Empirical Validation

Early high corruption rates (approx. 30%) induce the model to recover masked content using global context—promoting global reasoning. As training advances and the corruption rate drops (to 15%), the model is exposed to more intact context, enhancing its ability for token-level, local semantic reconstruction.

Ablation studies contrasted four configurations:

  • Token-level masking at 15% (fixed)
  • Whole-word masking at 15% (fixed)
  • Whole-word masking at 30% (fixed)
  • Dynamic curriculum: 15% \rightarrow 30% \rightarrow 15%

Empirical findings indicate that whole-word masking (WWM) outperforms token-level masking on word-level tasks (e.g., CMNLI, CSL in CLUE). Fixed high masking (30%) accelerates early convergence but underperforms long-term. The dynamic curriculum yields both rapid initial optimization and reduced final validation losses, effecting approximately 1–2 point gains on CLUE and \sim0.02 increases in Pearson rr on SimCLUE when compared to any fixed rate (Zhao et al., 14 Oct 2025).

4. Integration with Tokenization, Attention, and Optimization

The curriculum's effectiveness depends on the underlying tokenization scheme. A 32k BPE vocabulary is used to optimize chars/token (1.41 at 8k context) and enable reliable grouping for WWM. The schedule is employed during both Stage I (1024 tokens) and Stage II (8192 tokens) of pre-training. Lower mask rates in long-context phases facilitate dense local pattern acquisition, while the earlier high masking instills robust global context learning.

The masking curriculum is coordinated with a damped-cosine learning-rate schedule; both schedules peak near the end of warmup and decay afterward. Alternating local/global attention mechanisms and the rotary positional encoding (RoPE) stabilize optimization across the spectrum of mask ratios.

5. Quantitative Impact on Downstream Tasks

The downstream effects are assessed on the CLUE benchmark and SimCLUE STS evaluation. Under unified fine-tuning (3 epochs, peak LR =3×105=3\times10^{-5}), Chinese ModernBERT with dynamic WWM yields competitive or superior results across eight CLUE tasks. Selected results are summarized:

Model AFQMC TNEWS IFLYTEK CMNLI WSC CSL OCNLI C3^3
Chinese ModernBERT 73.87 56.90 60.15 83.96 52.10 86.20 79.10 82.65
RoBERTa-wwm-large 76.55 58.61 62.98 82.12 74.60 82.13 78.20 73.82

For semantic textual similarity (STS) on SimCLUE, fine-tuning with both SimCLUE and T2Ranking datasets (∼5M pairs) yields Pearson r=0.5050r = 0.5050, Spearman ρ=0.5367\rho = 0.5367, surpassing Qwen-0.6B-embedding under the same evaluation.

Model Contrastive Data rr (Pearson) ρ\rho (Spearman)
ModernBERT+5M pairs 5M 0.5050 0.5367
Qwen-0.6B-embed 169M 0.4965 0.5211
jina-v2-base-zh >800M 0.5188 0.5501
gte-multilingual 2.94B 0.5384 0.5730

Pearson’s rr and Spearman’s ρ\rho are the principal correlation metrics used.

6. Significance for Large-Scale Pre-Training

Whole-word masking with a dynamic curriculum—spanning a 30% to 15% masking schedule—effectively balances the dual objectives of rapid optimization and robust convergence. Its explicit integration with tokenizer design, context-length scaling, and adaptive attention yields consistent improvements in both Chinese language understanding and retrieval tasks. These results indicate that curriculum-aware masking strategies are a potent lever for advancing encoder-based pre-training, particularly in languages with complex morphology and tokenization phenomena (Zhao et al., 14 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Whole-Word Masking with Dynamic Curriculum.