Multilingual Curriculum Learning

Updated 21 November 2025

Multilingual Curriculum Learning is a training paradigm that schedules diverse linguistic data to enhance model generalization and transfer.
It leverages adaptive sampling, task sequencing, and difficulty-based ordering to address resource imbalances and typological diversity.
Empirical studies show improved zero-shot performance, faster convergence, and enhanced outcomes in tasks like MT and ASR.

Multilingual curriculum learning refers to a broad set of training paradigms in which the presentation of training signals—spanning multiple languages and, at times, modalities or scripts—is dynamically scheduled to optimize the generalization and transfer capacity of neural models. Rather than treating all languages, data sources, or example types as equally informative, these approaches leverage inductive biases, linguistic or cognitive theories, or model-centric feedback to arrange training sequences, assign sample weights, or interleave subtasks, thereby mitigating imbalances inherent in resource distributions and typological diversity.

1. Core Principles and Formal Objectives

The central tenet of multilingual curriculum learning is task sequencing: controlling either the inter-language sampling policy, the within-language ordering of examples based on difficulty, or both. Formally, the training objective in a multilingual, multi-task setting can be written as:

$\min_\theta \; \mathbb{E}_{i\sim P_{\text{task}}}\; \ell_i(\theta)$

where $\ell_i(\theta)$ is the per-task (usually per-language) loss and $P_{\text{task}}$ is the dynamic curriculum-induced sampling distribution. A canonical example is worst-case aware curriculum learning, which interpolates between minimizing the average loss across languages and focusing on the hardest (maximum) loss by stochastically choosing at each training step whether to update using the worst-performing language or via a loss-proportional probabilistic schedule (Lhoneux et al., 2022):

$\ell_{\mathrm{curr}}(\theta) = \begin{cases} \max_i \, \ell_i(\theta) & \text{if } p < \phi \ \ell_{i^*}(\theta), \text{ where } i^* \sim P_{\text{loss}}(i) & \text{otherwise} \end{cases}$

with $P_{\text{loss}}(i) = \frac{\ell_i(\theta)}{\sum_j \ell_j(\theta)}$ and $\phi \in [0,1]$ controlling the degree of worst-case targeting.

Alternative frameworks include competence-based curricula, which schedule the introduction of new languages or tasks based on measures of model readiness—typically as a function of per-language validation loss relative to a single-language upper bound or to dynamically learned performance graphs (Zhang et al., 2021).

Curricula may be defined over:

Languages (as in multi-task NMT)
Data subsets/shards defined by difficulty metrics (length, rarity, alignment, script similarity)
Subtasks (e.g., LangID $\to$ POS tagging $\to$ LM $\to$ sentiment analysis (Dahiya et al., 2019))
Script clusters or code-switch conditions (Chandu et al., 2023, Yoo et al., 2024)

2. Scheduling Policies and Difficulty Criteria

Curriculum learning in the multilingual context is realized through several classes of scheduling algorithms:

Fixed or Search-Based Schedules: Grid or tree search over possible inter-language sampling ratios (e.g., optimizing the mix of high-resource and low-resource languages for NMT) (Kumar et al., 2021).
Automated, Model-Based Schedules:
- Multi-Armed Bandits & Reinforcement Learning: Online policies trained to maximize improvements in proxy rewards (e.g., dev loss decrease), by selecting the next language or bin to sample (Allemann et al., 2024, Kumar et al., 2021). Deep Q-networks can generalize scheduling policies over the joint loss space.
- Competence- and Curriculum-Aware Sampling: Dynamic adaptation of sampling weights $v_i \propto 1 / c_i$ , where $c_i$ is a model-derived competence for language $i$ (Zhang et al., 2021).
Difficulty or Quality-Based Sorting: Ordering individual training examples or data shards by metrics such as alignment classifier confidence (faithfulness to source data), sequence length, or word rarity in noisy cross-lingual data-to-text generation (Hari et al., 2024).
Script and Code-Switching Curricula: Staging input according to script similarity (Roman $\to$ mixed $\to$ target script), or by progressing from token-level code-switching to sentence-level to full monolingual corpora (Chandu et al., 2023, Yoo et al., 2024).

Difficulty metrics can be model-centric (training dynamics: confidence, variability (Christopoulou et al., 2022)) or hand-crafted (POS complexity, linguistic or cognitive acquisition theories (Salhan et al., 2024)).

3. Model Architectures and Curriculum Integration

Most multilingual curriculum regimes interface with a family of underlying models:

Multilingual Encoders/Decoders: Transformer-based models (mBERT, XLM-R, mT5), sequence-to-sequence models, or RNN-T architectures (Lhoneux et al., 2022, Hari et al., 2024, Sun et al., 2023, Zhang et al., 2021).
Gated or Expert-Driven Layers: Language-specific sub-modules with dynamic gating, trained under language ID supervision, then transitioned to unconditioned inference via curriculum over gating input vectors (Sun et al., 2023).
Model Stacking and Adapters: Encoder–decoder fusion, e.g., MERLIN’s two-stage curriculum alignment, which progressively adapts a connector head from generic bitext mapping to task-specific QA alignment, before introducing low-rank adapters (DoRA) into the decoder for final task specialization (Uemura et al., 9 Sep 2025).
Self-Training and Script-Based Bootstrapping: Cross-lingual transfer via translated pseudo-labeling, staged by script or code-mixed variation (Chandu et al., 2023).
Curricular Masking in Pretraining: Dynamic control over masking indices guided by linguistic acquisition units (POS/SEM tags) with curricular transitions (Salhan et al., 2024).

4. Empirical Evaluation and Findings

Empirical studies across tasks and domains have consistently shown that multilingual curricula yield:

Improvements in zero-shot and transfer performance—especially on outlier, low-resource, or typologically distant languages (Lhoneux et al., 2022, Zhang et al., 2021, Kumar et al., 2021, Uemura et al., 9 Sep 2025)
Enhanced faithfulness, coverage, and BLEU for cross-lingual DTG, when curriculum is driven by alignment-based (not length/rarity) difficulty and employs annealing schedules (Hari et al., 2024)
Faster, smoother convergence in cross-lingual NLU tasks, as training dynamics-based curricula can exploit model uncertainty and confidence for difficulty ordering (Christopoulou et al., 2022)

A summary of key quantitative improvements:

Setting	Task/Model	Curriculum Gain	Reference
Zero-shot dependency parsing	mBERT/XLM-R	+0.8–1.4 / +0.4–0.9 LAS	(Lhoneux et al., 2022)
DTG (noisy alignment)	mT5	+4 BLEU, +15 pp faithfulness	(Hari et al., 2024)
NMT (Many→One; LRL)	Transformer	+1.8 BLEU (LRLs)	(Zhang et al., 2021)
Multilingual ASR	RNN-T	12.5% rel. WER	(Sun et al., 2023)
LLM fusion, low-resource math QA	Gemma-2+NLLB w/ DoRA	+12.9 pp (AfriMGSM)	(Uemura et al., 9 Sep 2025)
LLM zero/few-shot language transfer	Qwen 2, Gemma 2	+4.3–8.8 %p (HRL–LRL)	(Yoo et al., 2024)

Gains are largest in settings of typologically skewed or uneven language coverage, high label noise, or very limited resource conditions.

5. Theoretical and Cognitive Motivations

Several lines of research draw explicit parallels between machine curricula and human strategies for language acquisition:

Linguistic-acquisition-theory-inspired phased masking (e.g., GROWING/INWARDS/MMM curricula based on POS/SEM unit–unlocking) improve syntactic generalization for small-scale cross-lingual models, with particular efficacy for languages typologically distinct from English (Salhan et al., 2024).
Code-switching curriculum learning (CSCL) is motivated by patterns in bilingual human learners, with token- and sentence-level code-switched stages facilitating cross-lingual representation alignment and mitigating catastrophic forgetting during monolingual specialization (Yoo et al., 2024).

This suggests that fine-grained, language-specific curricular design—mirroring cognitive acquisition or sociolinguistic transfer effects—can outperform static or one-size-fits-all heuristics.

6. Practical Recommendations and Open Challenges

Best practices distilled from the literature include:

Use dynamic, model-centric signals (loss, competence, or training dynamics) to drive curriculum updates across languages or data shards, avoiding static schedules whenever possible (Lhoneux et al., 2022, Zhang et al., 2021, Christopoulou et al., 2022).
Schedule the introduction of low-resource or structurally distant languages only when the model demonstrates adequate competency in related HRLs (Zhang et al., 2021, Allemann et al., 2024).
For noisy or weakly-aligned data, prioritize curriculum based on explicit quality or faithfulness scores, not surface-level metrics (e.g., length or rarity) (Hari et al., 2024).
Where translation or annotation cost prohibits gold data, stage transfer by script or code-switching progression, leveraging code-mixed corpora to bootstrap models in new writing systems (Chandu et al., 2023, Yoo et al., 2024).
In model fusion architectures, introduce intermediate alignment or mapping stages to bridge non-English encoder states to monolingual LLM decoders (Uemura et al., 9 Sep 2025).
Periodically validate gains on both high- and low-resource subsets, as aggressive curriculum focusing on outliers may slightly degrade already-strong languages (Lhoneux et al., 2022, Zhang et al., 2021).

Open challenges remain in automating threshold tuning for competence-based curricula, scaling to hundreds of languages, robustly estimating per-language or per-example curriculum scores in highly heterogeneous data, and unifying curricula across modalities or supervised/unsupervised signals.

7. Notable Applications and Extensions

Multilingual curriculum learning has demonstrated utility in:

Machine translation (MT), both many-to-one and many-to-many, with bandit-based or competence-aware scheduling narrowing the LRL–HRL performance gap (Zhang et al., 2021, Allemann et al., 2024, Kumar et al., 2021)
Cross-lingual dependency parsing, where worst-case–oriented curricula systematically boost outlier language transfer (Lhoneux et al., 2022)
LLM fusion, with multi-stage connector training driving large gains on low-resource reasoning tasks (Uemura et al., 9 Sep 2025)
Cross-lingual data-to-text generation under heavy annotation noise, by leveraging alignment confidence plus curricular annealing (Hari et al., 2024)
Speech recognition (ASR), through curriculum on language-ID gating (Sun et al., 2023)
Small-scale multilingual pretraining, with acquisition-theory–aligned masking phases in CDS (Salhan et al., 2024)
Code-mixed sentiment analysis, via staged syntactico-semantic curricula (LangID $\to$ POS $\to$ LM $\to$ SA) (Dahiya et al., 2019)
Visual question answering, by script-aware data scheduling (Chandu et al., 2023)
Robustness and safety alignment in multilingual LLMs, through code-switching curricula (Yoo et al., 2024)

These approaches, while varying in their formalization of difficulty, core scheduling algorithms, and target task, all leverage the core idea that not all data is equally useful at all stages, and that cross-lingual or cross-script transfer can be engineered via curriculum-informed training schedules.