Papers
Topics
Authors
Recent
Search
2000 character limit reached

XLM-R Large: Scalable Multilingual Transformer

Updated 20 February 2026
  • XLM-R Large is a transformer-based multilingual model with 24 layers and around 550 million parameters, designed for extensive cross-lingual representation.
  • It is pretrained on 2–2.5 TB of multilingual text using masked language modeling, enabling robust performance across high- and low-resource languages.
  • Fine-tuning strategies and balanced data sampling yield state-of-the-art results on tasks like XNLI, MLQA, and factuality detection, outperforming many baselines.

XLM-R Large (XLM-RoBERTa Large) is a transformer-based multilingual LLM designed for high-capacity cross-lingual representation learning. Developed by Facebook AI Research, it is part of the XLM-R family and represents a significant advance in the scale and efficacy of unsupervised pretrained models for over one hundred languages. XLM-R Large is distinguished by its parameter count, model depth, vocabulary size, and pretraining data volume, enabling state-of-the-art transfer and generalization in low- and high-resource multilingual NLP tasks (Conneau et al., 2019). Fine-tuned variants such as XLM-RoBERTa-Large have established leading results in challenging evaluation settings, including competitive factuality detection in multilingual scientific text (Rathva et al., 23 Nov 2025).

1. Model Architecture

XLM-R Large employs a 24-layer transformer encoder with the following configuration:

  • Hidden size Hm=1024H_m = 1024
  • Feed-forward inner dimension Hff=4096H_{ff} = 4096
  • 16 attention heads per layer
  • SentencePiece vocabulary with 250,000 tokens
  • Maximum sequence length of 256 tokens in some adaptation scenarios
  • A total parameter count of approximately 550–560 million, depending on downstream head modifications

A classification head can be added for specific tasks—typically consisting of a dropout (probability 0.1) after the final [CLS] embedding, followed by a linear layer mapping to task logits (e.g., two logits for factual vs hallucinated classification) (Rathva et al., 23 Nov 2025).

Variant Layers Hidden Size FFN Dim Attention Heads Vocab Size Params (M)
XLM-R Large 24 1024 4096 16 250k 550
XLM-R Base 12 768 3072 12 250k 270
mBERT (baseline) 12 768 3072 12 110k 172

Compared to mBERT, XLM-R Large doubles model depth, increases hidden size by 33%, and more than triples the vocabulary size, resulting in ≈3× the parameter budget (Conneau et al., 2019).

2. Pretraining Data and Objectives

XLM-R Large is pretrained on filtered CommonCrawl data (CCNet or CC-100), comprising 2–2.5 terabytes of multilingual text. The pretraining corpus covers 100+ languages, from English (300.8 GiB) to low-resource languages such as Assamese (0.1 GiB). Documents are filtered to ensure at least 90% of tokens are in the target language and are deduplicated to reduce redundancy.

The model employs the standard masked language modeling (MLM) objective:

LMLM=iMlogP(xix\M;θ)\mathcal{L}_{MLM} = - \sum_{i \in M} \log P(x_i \mid \mathbf{x}_{\backslash M}; \theta)

where MM denotes randomly selected positions in the input sequence xx to be predicted (Conneau et al., 2019).

3. Empirical Performance and Cross-Lingual Generalization

XLM-R Large achieves state-of-the-art transfer results on a range of cross-lingual NLP tasks. Highlights include:

  • XNLI (Natural Language Inference, 15 languages):
    • Zero-shot (fine-tuned on English): 80.9% average accuracy, with English: 89.1%, French: 84.1%, German: 83.9%, Swahili: 73.8%.
    • Translate-train-all: 83.6% average accuracy.
  • MLQA (Extractive QA, zero-shot): 70.7 F1 (vs. mBERT 57.7), English F1: 80.6, Spanish: 74.1, Arabic: 63.1.
  • NER (CoNLL-02/03): Zero-shot average F1 = 80.94, monolingual average F1 = 90.24.
  • GLUE (English): Avg = 91.8 (comparable to RoBERTaLarge 92.8, BERTLarge 90.2).

Largest absolute transfer gains over mBERT are observed on low-resource languages: Swahili (+6.5% on XNLI), Urdu (+5.9%), and Arabic (+17.4 F1 on MLQA) (Conneau et al., 2019). The model matches or outperforms strong monolingual baselines in select high-resource languages.

In fine-tuned factuality detection (SHROOM-CAP shared task), XLM-RoBERTa-Large delivered competitive performance across nine languages, with zero-shot factuality F1 of 0.5107 (2nd place) in Gujarati and rankings of 4th–6th in other languages, demonstrating robust cross-lingual generalization (Rathva et al., 23 Nov 2025).

4. Training Strategies and Hyperparameters

Fine-tuning configurations for XLM-R Large typically involve:

  • Optimizer: AdamW
  • Learning rate: 2×1052 \times 10^{-5} with linear warmup (initial 10% of steps)
  • Batch size: 32
  • Number of epochs: 3
  • Weight decay: 0.01
  • Weighted cross-entropy loss with customized class weights (e.g., [1.50, 1.00]) to correct for remaining class imbalance in factuality tasks
  • Checkpointing every 5,000 steps; best determined by validation F1 (Rathva et al., 23 Nov 2025)

In multilingual hallucination detection, a training corpus of 124,821 samples—synthesized and balanced from five sources—allowed stable optimization and reduced model bias towards hallucination prediction. Evaluation utilized F1 and accuracy metrics, with factuality and fluency F1 reported per language (Rathva et al., 23 Nov 2025).

5. Scaling Analysis and Multilingual Trade-Offs

Scaling model capacity and data volume is critical for robust cross-lingual transfer. XLM-R Large outperforms capacity-constrained baselines, but the “curse of multilinguality” remains: as language coverage rises at fixed capacity, positive transfer to low-resource languages improves until interference degrades average accuracy.

Larger architectures mitigate, but do not fully eliminate, this interference. Sampling strategies that modestly upweight low-resource languages improve consistency. A large shared vocabulary enables better tokenization for rare scripts, and increased overall capacity allows higher effective coverage per language and task (Conneau et al., 2019).

6. Practical Impact and Data-Centric Approaches

XLM-R Large’s architecture is sufficiently general to be adapted for a variety of multilingual tasks, from natural language inference to specialized scientific text evaluation. Empirical evidence demonstrates that data-centric strategies—specifically, balancing and unifying diverse multilingual datasets—yield significantly greater improvements in factuality detection performance than do architectural modifications. High-quality, balanced training data enable stable cross-lingual generalization even in zero-shot settings. Complex ensembles or translation-based augmentation do not surpass fine-tuned XLM-R Large on abundant, balanced multilingual corpora (Rathva et al., 23 Nov 2025).

A plausible implication is that in extreme multilingual settings, investments in data curation and sampling strategy may offer greater returns than further scaling of model depth or parameter count, particularly for factual consistency and hallucination detection in scientific LLM outputs.

7. Comparative Analysis and Recommendations

XLM-R Large achieves competitive or superior performance relative to both multilingual and monolingual baselines across multiple tasks and languages. On GLUE and XNLI, it closely matches or exceeds monolingual transformers such as RoBERTaLarge and BERTLarge. For extraction-based QA and NER, performance meets or surpasses specialized CRF-based and SQuAD-adapted models in several high-resource languages.

Scaling both model size and pretraining corpus is essential for robust performance in low-resource languages. Practitioners are advised to maintain proportionality between the number of covered languages and model capacity, exploit large SentencePiece vocabularies for rare scripts, and apply balanced data sampling to resist capacity dilution effects. Emerging areas include further scaling (beyond 24 layers), adaptive per-language mixture-of-experts architectures, and improved corpus sampling (Conneau et al., 2019).


The XLM-R Large model exemplifies a data- and capacity-driven paradigm for scalable multilingual representation learning. Its ability to generalize across typologically and script-diverse languages, without sacrificing per-language performance, underpins its widespread adoption for cross-lingual NLP research and deployment (Conneau et al., 2019, Rathva et al., 23 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XLM-R Large.