Dynamic Whole-Word Masking Curriculum
- The paper introduces a dynamic masking curriculum that adjusts the whole-word mask ratio from 15% to 30% during pre-training to balance global recovery and local semantic refinement.
- It employs a mathematically defined, piecewise linear schedule integrated with specialized tokenization and attention mechanisms to optimize semantic coherence.
- Empirical evaluations show that this adaptive approach improves performance on Chinese language benchmarks, yielding gains on CLUE and enhanced correlations on SimCLUE tasks.
Whole-word masking with a dynamic curriculum is a pre-training strategy in encoder-only Transformer models—exemplified by Chinese ModernBERT—that synchronizes the masking ratio with training progress, transitioning from high to low packet-level corruption rates. This approach aims to balance global and local reasoning, leveraging masking schedules that evolve with the optimization horizon. The method is operationalized through a mathematically precise schedule, specialized tokenization, integration with system-level optimizations, and is empirically validated on complex Chinese language understanding and retrieval benchmarks (Zhao et al., 14 Oct 2025).
1. Mathematical Basis for the Dynamic Masking Curriculum
The dynamic curriculum is defined by a piecewise linear schedule that adjusts the whole-word mask ratio as pre-training advances. Let denote the total number of steps, the warmup length, and the current step. The mask ratios are bounded: , . Formally,
$\varepsilon(s) = \begin{cases} \varepsilon_{\min} + (\varepsilon_{\max} - \varepsilon_{\min})\,\frac{s}{S_w}, & 0 \le s \le S_w\[1em] \varepsilon_{\max} - (\varepsilon_{\max} - \varepsilon_{\min})\, \frac{s - S_w}{S - S_w}, & S_w < s \le S \end{cases}$
During the warmup (), the mask ratio linearly increases from 15% to 30%, incentivizing increased masking early in training. Subsequently (), the ratio linearly decays from 30% down to 15%, gradually lowering the corruption rate to enable local semantic refinement.
2. Operational Workflow: Whole-Word Masking with Scheduling
The masking procedure is synchronized with the dynamic ratio . The input is tokenized, and contiguous subwords are grouped as full-word spans. Words are randomly sampled, and their constituent tokens are masked until the expected proportion of masked tokens meets . Mask application is performed as follows: 80% of word tokens are replaced with [MASK], 10% with a random token, and 10% are left unchanged. This ensures semantic coherence by masking entire words instead of isolated tokens, and guarantees that the running mask ratio tracks the evolving curriculum.
3. Motivations, Ablations, and Empirical Validation
Early high corruption rates (approx. 30%) induce the model to recover masked content using global context—promoting global reasoning. As training advances and the corruption rate drops (to 15%), the model is exposed to more intact context, enhancing its ability for token-level, local semantic reconstruction.
Ablation studies contrasted four configurations:
- Token-level masking at 15% (fixed)
- Whole-word masking at 15% (fixed)
- Whole-word masking at 30% (fixed)
- Dynamic curriculum: 15% 30% 15%
Empirical findings indicate that whole-word masking (WWM) outperforms token-level masking on word-level tasks (e.g., CMNLI, CSL in CLUE). Fixed high masking (30%) accelerates early convergence but underperforms long-term. The dynamic curriculum yields both rapid initial optimization and reduced final validation losses, effecting approximately 1–2 point gains on CLUE and 0.02 increases in Pearson on SimCLUE when compared to any fixed rate (Zhao et al., 14 Oct 2025).
4. Integration with Tokenization, Attention, and Optimization
The curriculum's effectiveness depends on the underlying tokenization scheme. A 32k BPE vocabulary is used to optimize chars/token (1.41 at 8k context) and enable reliable grouping for WWM. The schedule is employed during both Stage I (1024 tokens) and Stage II (8192 tokens) of pre-training. Lower mask rates in long-context phases facilitate dense local pattern acquisition, while the earlier high masking instills robust global context learning.
The masking curriculum is coordinated with a damped-cosine learning-rate schedule; both schedules peak near the end of warmup and decay afterward. Alternating local/global attention mechanisms and the rotary positional encoding (RoPE) stabilize optimization across the spectrum of mask ratios.
5. Quantitative Impact on Downstream Tasks
The downstream effects are assessed on the CLUE benchmark and SimCLUE STS evaluation. Under unified fine-tuning (3 epochs, peak LR ), Chinese ModernBERT with dynamic WWM yields competitive or superior results across eight CLUE tasks. Selected results are summarized:
| Model | AFQMC | TNEWS | IFLYTEK | CMNLI | WSC | CSL | OCNLI | C |
|---|---|---|---|---|---|---|---|---|
| Chinese ModernBERT | 73.87 | 56.90 | 60.15 | 83.96 | 52.10 | 86.20 | 79.10 | 82.65 |
| RoBERTa-wwm-large | 76.55 | 58.61 | 62.98 | 82.12 | 74.60 | 82.13 | 78.20 | 73.82 |
For semantic textual similarity (STS) on SimCLUE, fine-tuning with both SimCLUE and T2Ranking datasets (∼5M pairs) yields Pearson , Spearman , surpassing Qwen-0.6B-embedding under the same evaluation.
| Model | Contrastive Data | (Pearson) | (Spearman) |
|---|---|---|---|
| ModernBERT+5M pairs | 5M | 0.5050 | 0.5367 |
| Qwen-0.6B-embed | 169M | 0.4965 | 0.5211 |
| jina-v2-base-zh | >800M | 0.5188 | 0.5501 |
| gte-multilingual | 2.94B | 0.5384 | 0.5730 |
Pearson’s and Spearman’s are the principal correlation metrics used.
6. Significance for Large-Scale Pre-Training
Whole-word masking with a dynamic curriculum—spanning a 30% to 15% masking schedule—effectively balances the dual objectives of rapid optimization and robust convergence. Its explicit integration with tokenizer design, context-length scaling, and adaptive attention yields consistent improvements in both Chinese language understanding and retrieval tasks. These results indicate that curriculum-aware masking strategies are a potent lever for advancing encoder-based pre-training, particularly in languages with complex morphology and tokenization phenomena (Zhao et al., 14 Oct 2025).