Multilingual Pretraining
- Multilingual pretraining is a foundational approach in NLP that uses objectives like masked language modeling and machine translation to build cross-lingual representations.
- It employs encoder-only, decoder-only, and encoder-decoder architectures trained on a mix of monolingual, parallel, and synthetic data to enhance language generalization.
- Empirical studies show significant gains in tasks such as machine translation, cross-lingual retrieval, and low-resource speech recognition through balanced and aligned multilingual training.
Multilingual pretraining is a foundational approach within NLP and speech processing that enables neural models to acquire representations and capacities spanning multiple languages through unsupervised, supervised, or hybrid objectives. This paradigm has underpinned significant advancements in machine translation, zero-shot transfer, cross-lingual retrieval, multilingual understanding, and low-resource speech recognition. The following sections delineate key methodologies, empirical findings, and best practices for multilingual pretraining across text, speech, and multimodal domains, based strictly on empirical results and methodologies reported in primary research.
1. Core Objectives and Model Architectures
The central objective of multilingual pretraining is to imbue models with both monolingual and cross-lingual generalization ability by training on diverse linguistic corpora. Model architectures fall into encoder-only (e.g., BERT, XLM-R), decoder-only (e.g., GPT-style), and encoder-decoder (e.g., mBART, mT5, DOCmT5, TRIP) classes. The predominant pretraining objectives are:
- Masked Language Modeling (MLM): Mask a fraction of input tokens and train the model to reconstruct them, summing the negative log-likelihoods:
- Denoising Autoencoding: Corrupt inputs via span masking and sentence permutation, then reconstruct, as in mBART (Tang et al., 2020) and mT5.
- Multilingual Machine Translation (MT): Sequence-to-sequence architectures are trained to translate between language pairs, using cross-entropy loss over parallel data (Li et al., 2024).
- Hybrid and Multitask Objectives: Recent works interpolate MLM, denoising, MT, dictionary denoising, and bitext translation (Reid et al., 2021), or exploit explicit alignment and code-switching.
Empirical studies establishing best practices (Li et al., 2024) find that architectural type dictates the optimal pretraining loss: encoder-decoder models derive large gains from genuine MT objectives, while encoder-only models favor MLM for downstream tasks like classification and tagging.
2. Data Regimes: Monolingual, Parallel, Synthetic, and Script Diversity
Monolingual and Synthetic Data
Pretraining often combines large-scale monolingual corpora (e.g., CC100, mC4) for each language, potentially augmented with synthetically-generated data—either via machine translation of high-quality English, as in TransWebEdu (Wang et al., 18 Feb 2025, Wang et al., 2024), or by code-mixing and code-switch augmentation (Ni et al., 2020).
Parallel and Trilingual Data
Integration of parallel sentence- and document-level bitext yields marked improvements for translation and cross-lingual NLI (Reid et al., 2021, Lu et al., 2022). TRIP demonstrates that leveraging document-triple corpora with grafting-based trilingual mixing further surpasses bilingual modeling in document-level MT and cross-lingual summarization (Lu et al., 2022).
Script and Typological Coverage
Inclusion of diverse scripts (Latin, Cyrillic, Devanagari, Han, etc.) and low-resource tongues is critical for real-world generalization (Liu et al., 2024, Kesen et al., 27 May 2025). Smoothed temperature-based sampling of languages and scripts during pretraining (with up-weighting for rare languages via temperatures ) mitigates capacity dilution and catastrophic forgetting.
3. Enlarging Language Coverage: Extension, Alignment, and Adaptation
Scaling models to new languages, subwords, or scripts is typically achieved via one of several strategies:
- Vocabulary and Embedding Extension: Add new language/special-ID tokens, randomly initialize their rows, and continue MLM training with mixed data (Tang et al., 2020, Liu et al., 2023).
- Embedding Alignment: Explicitly align lexical and subword embeddings using external multilingual vectors, leveraging bilingual dictionaries or cross-lingual word vectors. The OFA framework interpolates new embeddings via cosine similarity to existing subwords and factorizes the embedding matrix for parameter efficiency, accelerating convergence and reducing CO₂ impact (Liu et al., 2023). Alignment losses (e.g., ALIGN-MLM) directly maximize cosine similarity for translation pairs, outperforming MLM/TLM especially on sequence labeling and cross-script transfer (Tang et al., 2022).
- Transfer and Adaptation: "RAMEN" initializes a strong English PLM, then aligns target-language embeddings either through parallel data or monolingual fastText vectors, followed by dual-language MLM to prevent catastrophic forgetting (Tran, 2020).
- Continued Pretraining: Further training a foundation model on new monolingual or bilingual corpora ("continued pretraining") helps adaptation, but can cause catastrophic interference if not balanced across languages (Gao et al., 2024).
4. Evaluation Protocols and Empirical Trends
Benchmarks
Standard evaluation employs multilingual versions of NLI (XNLI), POS tagging (Universal Dependencies), dependency parsing, named entity recognition (WikiANN), sequence retrieval (Tatoeba, Bible), and topic or commonsense reasoning (e.g., ARC, PAWS-X, HellaSwag). For MT, tokenized BLEU (sacreBLEU) and for summarization, ROUGE are standard.
Empirical Outcomes
- Translation/Seq2Seq: Multilingual pretraining via mBART and multilingual finetuning attains +9.8 BLEU over bilingual from-scratch baselines in many-to-English ML50 translation, with even larger gains (up to +15 BLEU) for extremely low-resource tongues (Tang et al., 2020). Trilingual objectives and authentic document-level triples further lift document-level MT performance (Lu et al., 2022).
- Classification/NER/POS: Encoder-only models pretrained with MLM outperform causal LMs and translation LMs on fine-tuned sequence labeling; causal LMs are favorable for probing tasks (Li et al., 2024).
- Speech: IPA-guided multilingual pretraining with HuBERT achieves stronger low-resource ASR and reduces data needs by up to 75% versus unsupervised HuBERT or XLSR-53 (Feng et al., 2023).
- Multimodal: M³P fuses 100+ language monolingual MLM, image-text contrastive, and code-switching objectives, setting state-of-the-art cross-lingual vision-language retrieval in non-English (Ni et al., 2020).
A consistent pattern is that mixed, balanced multilingual pretraining yields superior cross-lingual knowledge alignment across performance and output consistency (CLiKA-RA, en-CO), whereas continued monolingual pretraining improves only the target at the expense of others (Gao et al., 2024). However, both strategies leave deep knowledge conductivity essentially unresolved (CLiKA-XRR ≈ 0).
5. Special Strategies: Data Quality, Script/Lang Embeddings, and Clustered Training
Pretraining Data Quality
Quality filtering of web-scale corpora using distilled LLM-based annotators (JQL) substantially boosts both token retention and downstream performance (+6–7% normalized gains) over heuristically filtered sources (Ali et al., 28 May 2025). Best practices involve percentile-based ensemble thresholds and robust multilingual embedding backbones.
Language/Script Embedding Architectures
LangSAMP demonstrates that re-introducing language and script embeddings—added only at the MLM decoding head—improves representation neutrality and enables efficient donor language selection for zero-shot transfer. The learned embeddings recover typological structure in PCA space and correlate with transfer efficiency (Liu et al., 2024).
Clustering-Based Training
Clustering languages using representation "sprachbund" vectors (average [CLS] embeddings from a strong mPLM) aligns pretraining corpora to genealogical, typological, and areal groupings. Separate MLM pretraining for each sprachbund outperforms a single model trained on all, especially for low-resource and isolated languages (Fan et al., 2021).
6. Recent Directions and Outstanding Challenges
Synthetic and Machine-Translated Data
Machine translation of high-quality English corpora (e.g., FineWeb-Edu) into multiple languages provides an efficient scalable solution for pretraining, obviating the need for local language-specific web crawling. Models pretrained on purely synthetic bitext (TransWebEdu, TransWebLLM) can match or exceed closed-data models on cross-lingual reasoning and understanding, even in low-resource settings, with as little as 6% of the training tokens (Wang et al., 18 Feb 2025, Wang et al., 2024).
Alignment and Beyond
Embedding-level alignment (ALIGN-MLM), translation pair discrimination (TPP), dictionary injection, and bitext denoising are all confirmed to significantly reduce the cross-lingual "transfer gap," especially for script-divergent or syntactically divergent languages (Tang et al., 2022, Mishra et al., 2021). Explicit alignment objectives outperform both MLM and MLM+parallel approaches in systematic transfer settings.
Multilingual Pixel Models
Multilingual pretraining in pixel-based (vision-based) LLMs, using only rendered text images, enables strong, script-agnostic transfer. Injecting just four scripts with balanced data suffices to shift semantic space alignment and improve downstream syntactic, semantic, and morphological probing even in unseen scripts (Kesen et al., 27 May 2025).
Pretraining Dynamics
Time-resolved probing of training dynamics in XLM-R reveals that monolingual syntax is acquired early (≈1.3% of updates), semantics later, and cross-lingual transfer later still, and is highly asymmetric and language-pair-dependent. Late in training, linguistic knowledge propagates to lower layers, while final-layer representations can degrade ("forget") (Blevins et al., 2022).
7. Best Practices and Recommendations
- Use mixed, balanced multilingual data for broad transfer; avoid target-language-only continued pretraining unless domain specificity outweighs cross-lingual needs (Gao et al., 2024).
- Choose pretraining objectives and model architectures in tandem: encoder-decoder models benefit most from explicit MT objectives (Li et al., 2024).
- Explicitly inject alignment (bilingual dictionaries, cross-lingual word vectors, alignment objectives) in vocabulary expansion and adaptation (Liu et al., 2023, Tang et al., 2022).
- Employ data-driven or typologically-aware clustering (sprachbunds) for scalable multilingual pretraining with reduced negative transfer (Fan et al., 2021).
- Invest in multilingual pretraining data filtering via learned LLM-based annotators for web data quality (Ali et al., 28 May 2025).
- For speech, leverage language-universal phonetic (IPA) pseudo-labels and/or continued self-supervised pretraining on even modest target data (Feng et al., 2023, Nowakowski et al., 2023).
- In low-supervision or zero-shot scenarios, supplement with synthetic machine-translated data, avoiding catastrophic imbalance by proper sampling.
- Evaluate cross-lingual knowledge alignment at multiple levels (performance, consistency, conductivity); current architectures and objectives achieve only shallow alignment (high RA and en-CO, but XRR near zero) (Gao et al., 2024).
Multilingual pretraining thus encompasses a suite of specialized methodologies for learning robust, transferable representations spanning language, script, and—in multimodal settings—modality boundaries. Ongoing challenges include achieving deep cross-lingual knowledge synthesis, scaling to typologically rare or resource-constrained languages, and ensuring parameter- and carbon-efficiency at scale.