Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multilingual Curriculum Learning

Updated 21 November 2025
  • Multilingual Curriculum Learning is a training paradigm that schedules diverse linguistic data to enhance model generalization and transfer.
  • It leverages adaptive sampling, task sequencing, and difficulty-based ordering to address resource imbalances and typological diversity.
  • Empirical studies show improved zero-shot performance, faster convergence, and enhanced outcomes in tasks like MT and ASR.

Multilingual curriculum learning refers to a broad set of training paradigms in which the presentation of training signals—spanning multiple languages and, at times, modalities or scripts—is dynamically scheduled to optimize the generalization and transfer capacity of neural models. Rather than treating all languages, data sources, or example types as equally informative, these approaches leverage inductive biases, linguistic or cognitive theories, or model-centric feedback to arrange training sequences, assign sample weights, or interleave subtasks, thereby mitigating imbalances inherent in resource distributions and typological diversity.

1. Core Principles and Formal Objectives

The central tenet of multilingual curriculum learning is task sequencing: controlling either the inter-language sampling policy, the within-language ordering of examples based on difficulty, or both. Formally, the training objective in a multilingual, multi-task setting can be written as:

minθ  EiPtask  i(θ)\min_\theta \; \mathbb{E}_{i\sim P_{\text{task}}}\; \ell_i(\theta)

where i(θ)\ell_i(\theta) is the per-task (usually per-language) loss and PtaskP_{\text{task}} is the dynamic curriculum-induced sampling distribution. A canonical example is worst-case aware curriculum learning, which interpolates between minimizing the average loss across languages and focusing on the hardest (maximum) loss by stochastically choosing at each training step whether to update using the worst-performing language or via a loss-proportional probabilistic schedule (Lhoneux et al., 2022):

curr(θ)={maxii(θ)if p<ϕ i(θ), where iPloss(i)otherwise\ell_{\mathrm{curr}}(\theta) = \begin{cases} \max_i \, \ell_i(\theta) & \text{if } p < \phi \ \ell_{i^*}(\theta), \text{ where } i^* \sim P_{\text{loss}}(i) & \text{otherwise} \end{cases}

with Ploss(i)=i(θ)jj(θ)P_{\text{loss}}(i) = \frac{\ell_i(\theta)}{\sum_j \ell_j(\theta)} and ϕ[0,1]\phi \in [0,1] controlling the degree of worst-case targeting.

Alternative frameworks include competence-based curricula, which schedule the introduction of new languages or tasks based on measures of model readiness—typically as a function of per-language validation loss relative to a single-language upper bound or to dynamically learned performance graphs (Zhang et al., 2021).

Curricula may be defined over:

  • Languages (as in multi-task NMT)
  • Data subsets/shards defined by difficulty metrics (length, rarity, alignment, script similarity)
  • Subtasks (e.g., LangID \to POS tagging \to LM \to sentiment analysis (Dahiya et al., 2019))
  • Script clusters or code-switch conditions (Chandu et al., 2023, Yoo et al., 2024)

2. Scheduling Policies and Difficulty Criteria

Curriculum learning in the multilingual context is realized through several classes of scheduling algorithms:

  • Fixed or Search-Based Schedules: Grid or tree search over possible inter-language sampling ratios (e.g., optimizing the mix of high-resource and low-resource languages for NMT) (Kumar et al., 2021).
  • Automated, Model-Based Schedules:
    • Multi-Armed Bandits & Reinforcement Learning: Online policies trained to maximize improvements in proxy rewards (e.g., dev loss decrease), by selecting the next language or bin to sample (Allemann et al., 2024, Kumar et al., 2021). Deep Q-networks can generalize scheduling policies over the joint loss space.
    • Competence- and Curriculum-Aware Sampling: Dynamic adaptation of sampling weights vi1/civ_i \propto 1 / c_i, where cic_i is a model-derived competence for language ii (Zhang et al., 2021).
  • Difficulty or Quality-Based Sorting: Ordering individual training examples or data shards by metrics such as alignment classifier confidence (faithfulness to source data), sequence length, or word rarity in noisy cross-lingual data-to-text generation (Hari et al., 2024).
  • Script and Code-Switching Curricula: Staging input according to script similarity (Roman \to mixed \to target script), or by progressing from token-level code-switching to sentence-level to full monolingual corpora (Chandu et al., 2023, Yoo et al., 2024).

Difficulty metrics can be model-centric (training dynamics: confidence, variability (Christopoulou et al., 2022)) or hand-crafted (POS complexity, linguistic or cognitive acquisition theories (Salhan et al., 2024)).

3. Model Architectures and Curriculum Integration

Most multilingual curriculum regimes interface with a family of underlying models:

4. Empirical Evaluation and Findings

Empirical studies across tasks and domains have consistently shown that multilingual curricula yield:

A summary of key quantitative improvements:

Setting Task/Model Curriculum Gain Reference
Zero-shot dependency parsing mBERT/XLM-R +0.8–1.4 / +0.4–0.9 LAS (Lhoneux et al., 2022)
DTG (noisy alignment) mT5 +4 BLEU, +15 pp faithfulness (Hari et al., 2024)
NMT (Many→One; LRL) Transformer +1.8 BLEU (LRLs) (Zhang et al., 2021)
Multilingual ASR RNN-T 12.5% rel. WER (Sun et al., 2023)
LLM fusion, low-resource math QA Gemma-2+NLLB w/ DoRA +12.9 pp (AfriMGSM) (Uemura et al., 9 Sep 2025)
LLM zero/few-shot language transfer Qwen 2, Gemma 2 +4.3–8.8 %p (HRL–LRL) (Yoo et al., 2024)

Gains are largest in settings of typologically skewed or uneven language coverage, high label noise, or very limited resource conditions.

5. Theoretical and Cognitive Motivations

Several lines of research draw explicit parallels between machine curricula and human strategies for language acquisition:

  • Linguistic-acquisition-theory-inspired phased masking (e.g., GROWING/INWARDS/MMM curricula based on POS/SEM unit–unlocking) improve syntactic generalization for small-scale cross-lingual models, with particular efficacy for languages typologically distinct from English (Salhan et al., 2024).
  • Code-switching curriculum learning (CSCL) is motivated by patterns in bilingual human learners, with token- and sentence-level code-switched stages facilitating cross-lingual representation alignment and mitigating catastrophic forgetting during monolingual specialization (Yoo et al., 2024).

This suggests that fine-grained, language-specific curricular design—mirroring cognitive acquisition or sociolinguistic transfer effects—can outperform static or one-size-fits-all heuristics.

6. Practical Recommendations and Open Challenges

Best practices distilled from the literature include:

  • Use dynamic, model-centric signals (loss, competence, or training dynamics) to drive curriculum updates across languages or data shards, avoiding static schedules whenever possible (Lhoneux et al., 2022, Zhang et al., 2021, Christopoulou et al., 2022).
  • Schedule the introduction of low-resource or structurally distant languages only when the model demonstrates adequate competency in related HRLs (Zhang et al., 2021, Allemann et al., 2024).
  • For noisy or weakly-aligned data, prioritize curriculum based on explicit quality or faithfulness scores, not surface-level metrics (e.g., length or rarity) (Hari et al., 2024).
  • Where translation or annotation cost prohibits gold data, stage transfer by script or code-switching progression, leveraging code-mixed corpora to bootstrap models in new writing systems (Chandu et al., 2023, Yoo et al., 2024).
  • In model fusion architectures, introduce intermediate alignment or mapping stages to bridge non-English encoder states to monolingual LLM decoders (Uemura et al., 9 Sep 2025).
  • Periodically validate gains on both high- and low-resource subsets, as aggressive curriculum focusing on outliers may slightly degrade already-strong languages (Lhoneux et al., 2022, Zhang et al., 2021).

Open challenges remain in automating threshold tuning for competence-based curricula, scaling to hundreds of languages, robustly estimating per-language or per-example curriculum scores in highly heterogeneous data, and unifying curricula across modalities or supervised/unsupervised signals.

7. Notable Applications and Extensions

Multilingual curriculum learning has demonstrated utility in:

These approaches, while varying in their formalization of difficulty, core scheduling algorithms, and target task, all leverage the core idea that not all data is equally useful at all stages, and that cross-lingual or cross-script transfer can be engineered via curriculum-informed training schedules.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multilingual Curriculum Learning.