Multilingual mBART Model

Updated 13 January 2026

Multilingual mBART is a sequence-to-sequence Transformer pretrained via denoising to support tasks like translation, summarization, and speech recognition.
It features a 12-layer encoder and decoder with a shared 250K subword vocabulary across up to 50 languages, enabling effective cross-lingual transfer.
Fine-tuning and adapter architectures enhance its performance in low-resource settings and extend its support to new languages and data modalities.

The multilingual mBART model—Multilingual Bidirectional and Auto-Regressive Transformer—is a sequence-to-sequence Transformer pretrained on large-scale monolingual corpora in many languages via a multilingual denoising objective. It supports a wide range of downstream tasks including machine translation, cross-lingual transfer, natural language generation, document summarization, and even speech recognition, illustrating its extensibility and relevance to low-resource and typologically diverse languages.

1. Core Architecture and Multilingual Pretraining

mBART is instantiated as a 12-layer encoder and 12-layer decoder Transformer, with model hidden dimension 1024, feed-forward size 4096, and 16 attention heads per layer. The vocabulary is a shared SentencePiece model spanning 250,000 subwords for up to 50 languages (mBART-50 variant), with special language-indicator tokens prepended to denote source and target languages (Liu et al., 2020, Tang et al., 2020, Chronopoulou et al., 2022).

During pretraining, mBART applies a denoising auto-encoding objective: the input sentence $x$ undergoes span masking (mean length ≈3), token deletion, and sentence shuffling through a stochastic noising function $q_\epsilon(x)$ . The model is trained to reconstruct the original sentence via the loss: $L(\theta) = -\mathbb{E}_{x\sim D}\mathbb{E}_{\tilde{x}\sim q(\tilde{x}|x)}\sum_{t=1}^{|x|}\log p_\theta(x_t|\tilde{x})$ This denoising pretraining yields bidirectional source encoding and autoregressive target decoding, enabling robust cross-lingual transfer (Liu et al., 2020).

Pretraining draws from the CommonCrawl CC-25 or CC-50 corpora. Sampling is balanced by temperature-based upsampling and downsampling for frequency extremes, resulting in bilingual and multilingual variants (e.g. mBART-25, mBART-50) (Liu et al., 2020, Tang et al., 2020).

2. Multilingual Fine-tuning and Transfer Learning

Finetuning converts mBART into many-to-many or many-to-one machine translation systems by augmenting inputs with appropriate language tags and training on parallel bitext for all supported language pairs. The downstream loss is the standard token-level cross-entropy over the parallel data: $L_{MT} = -\sum_{(x, y)\in S}\sum_{t=1}^{|y|}\log p_\theta(y_t|y_{<t}, x)$

Multilingual fine-tuning (ML-FT) outperforms both bilingual fine-tuning and multilingual models trained from scratch, particularly in into-English (N→1) and low-resource settings (+3.6 BLEU over BL-FT), with gains up to +7 BLEU for 4–10k bitext language pairs (Tang et al., 2020). The data efficiency plateaus at ~50,000 parallel sentences; clean, in-domain bitext outperforms large, noisy web-mined data (Lee et al., 2022). Parameter sharing enables substantial gains for typologically similar languages; distant and unseen languages (not covered during pretraining) remain challenging, with open-domain BLEU <3 unless additional monolingual or parallel data is supplied (Lee et al., 2022, Imamura et al., 2022).

3. Extending mBART to New Languages and Data Modalities

Vocabulary expansion to new scripts proceeds by updating the SentencePiece tokenizer: the original subword set is augmented with a set of new subwords for the target language, maintaining the pre-existing entries unchanged. Associated embedding tables in the encoder and decoder are extended by randomly initialized rows, ensuring OOV-free tokenization with moderate sequence lengths (≈2k–8k new subwords recommended) (Imamura et al., 2022). Language-specific tags are always added, preserving positional and language-tag priors.

This enables mBART’s fine-tuning and transfer to previously unsupported orthographies and languages. The recipe generalizes to any low-resource language with a unique script: extract substrings, run EM-based likelihood updates, extend vocab and embeddings, and fine-tune on available bitext (Imamura et al., 2022).

4. Adapter Architectures and Parameter-Efficient Transfer

Recent developments introduce lightweight adapters atop mBART for low-resource transfer. Language-family adapters—a bottleneck two-layer feed-forward structure with residuals—are inserted after each feed-forward sublayer (and embedding layers when needed) in both encoder and decoder. The main backbone is frozen, ensuring parameter efficiency and reducing catastrophic forgetting (Chronopoulou et al., 2022).

BLEU gains average +1.0 over language-pair adapters and +2.7 over language-agnostic adapters for en→xx transfer, demonstrating robust cross-family transfer, especially for unseen languages. The optimal configuration balances specialization and sharing, with ~81M trainable parameters (≈12% of mBART-50), negligible inference overhead, and support for zero- or few-shot extension to unseen languages via additional embedding adapters (Chronopoulou et al., 2022).

5. Specialized Task Adaptations: Interactive MT, Lexical Normalization, ASR

In interactive machine translation (IMT), mBART is adapted for segment-based user feedback: validated (correct) segments from user interaction anchor the output and constrained decoding only regenerates the “gaps” between validated spans. mBART’s denoising objective and bidirectional encoder confer robustness to this segmented protocol, yielding SoTA translation quality and competitive user effort relative to from-scratch NMT models (Navarro et al., 2024).

For lexical normalization, mBART is fine-tuned in monolingual or multilingual settings. Freezing the encoder during fine-tuning best preserves cross-lingual representations, yielding consistent downstream improvements in parsing and sentiment/offensive-language classification despite modest intrinsic error reduction rates (Bucur et al., 2021).

In multilingual speech recognition, mBART-50 decoders are paired with wav2vec 2.0 encoders. Per-language adapters or factorized weights enable efficient language adaptation; WER gains average 44% over supervised baselines, with largest improvements for the 10–100h regime (Pham et al., 2022).

6. Zero-shot, Few-shot, and Cross-lingual Generation

Zero-shot cross-lingual transfer with mBART is enabled by the multilingual encoder–decoder backbone and language-indicator tokens. Performance correlates with pretraining data coverage, typological proximity, and data quality. For languages included in pretraining, as few as 10,000–25,000 parallel sentences suffice for reasonable BLEU scores (up to 41 on Hindi) (Lee et al., 2022). For unseen typologically distant languages, performance is limited without further monolingual adaptation.

ZmBART extends mBART for unsupervised cross-lingual NLG. An auxiliary auto-regressive denoising task on monolingual data for the target languages (e.g., En, Hi, Ja) aligns pretraining and fine-tuning objectives, enabling genuine zero-shot and few-shot transfer. Freezing the decoder and shared embeddings during English fine-tuning mitigates catastrophic forgetting, while data augmentation suppresses code-mixing. ZmBART achieves notable improvements over baseline pipelines and backward-compatible mBART variants (Maurya et al., 2021).

Code-Switching Restore (CSR) adds a stage of code-switched sentence restoration before finetuning, further narrowing cross-lingual embedding misalignment and improving BLEU and COMET for both bilingual and zero-shot translation, cross-lingual summarization, and classification (Zan et al., 2022).

7. Evaluation, Limitations, and Task-Specific Insights

mBART routinely achieves state-of-the-art BLEU, chrF2, SacreBLEU, ROUGE, and COMET scores on supervised and unsupervised MT, summarization, text-to-gloss translation, and sequence normalization. In Bangla text-to-gloss, mBART-50 fine-tuned on rule-augmented and synthetic gloss data achieves SacreBLEU=79.53%, compared to 3.95% for multilingual BERT. On PHOENIX-14T gloss translation, mBART-50 achieves BLEU-2=38.07 and COMET=0.624, surpassing recent transformer and RNN models (Abdullah et al., 3 Apr 2025).

For Nepali abstractive summarization, LoRA-adapted and quantized mBART models provide top ROUGE-L and human-selected fluency scores under both 4-bit and 8-bit quantized weights, outperforming mT5 (Dhakal et al., 2024).

However, zero-shot translation and adaptation to unseen languages remain challenging unless the language is covered in pretraining or the tokenizer and embeddings are extended appropriately. For typologically distant languages not seen in pretraining, even 100k parallel sentences yield BLEU <3.0 (Lee et al., 2022). Fine-tuning monolingually is preferable to joint multilingual training for normalization tasks (Bucur et al., 2021). Parameter-efficient methods and cross-family adapters address negative transfer, but cannot wholly mitigate absence of high-quality data (Chronopoulou et al., 2022, Tonja et al., 2023).

Summary Table: mBART Key Features and Outcomes

Dimension	Details/Results	Source
Architecture	12×12 Transformer, 1024-dim, 4096 FFN, 16 heads, 250K SP vocab, language-tag tokens	(Liu et al., 2020)
Pretraining Objective	Multilingual denoising: span-masking, sentence permutation; reconstruct $x$ from $q_\epsilon(x)$	(Liu et al., 2020)
Fine-tuning	ML-FT (many-to-many), ML-SC (multilingual scratch), adapters (LANG-FAM, LANG-PAIR, LANG-AGN)	(Chronopoulou et al., 2022)
Extension/Adapters	Add vocab rows for new scripts, freeze backbone, bottleneck adapters per language-family	(Imamura et al., 2022, Chronopoulou et al., 2022)
Zero-shot Transfer	Genuine zero-shot with decoder freezing, auxiliary rand-summary task (ZmBART)	(Maurya et al., 2021)
Evaluation	BLEU, SacreBLEU, ROUGE, COMET, chrF2, WSR/KSR/MAR, human quality scores	(Navarro et al., 2024, Abdullah et al., 3 Apr 2025, Dhakal et al., 2024, Tonja et al., 2023)
Task Domains	MT, IMT, NLG, text-to-gloss, summarization, normalization, speech recognition	Multiple
Low-resource Impact	+3–7 BLEU for ML-FT/adapter methods; data efficiency 4×–10× over from-scratch	(Chronopoulou et al., 2022, Tang et al., 2020, Lee et al., 2022)

The multilingual mBART model demonstrates that large-scale denoising pretraining, combined with extensible tokenization, parameter-efficient adaptation, and multilingual fine-tuning, provides a flexible and empirically robust platform for cross-lingual generation and understanding. Its strengths are most pronounced in low-resource and typologically diverse settings when coupled with careful extension and adaptation protocols (Liu et al., 2020, Chronopoulou et al., 2022, Imamura et al., 2022, Tang et al., 2020, Maurya et al., 2021, Navarro et al., 2024, Abdullah et al., 3 Apr 2025, Dhakal et al., 2024, Zan et al., 2022, Bucur et al., 2021, Lee et al., 2022).