Unsupervised Cross-lingual Representation Learning at Scale

Published 5 Nov 2019 in cs.CL | (1911.02116v2)

Abstract: This paper shows that pretraining multilingual LLMs at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked LLM on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.

Abstract PDF Upgrade to Chat

Citations (5,846)

View on Semantic Scholar

Summary

The paper introduces XLM-R, a scalable unsupervised model trained on CommonCrawl data across 100 languages, establishing new state-of-the-art results on various cross-lingual benchmarks.
It employs architectural modifications such as SentencePiece tokenization and a shared 250k token vocabulary to enhance multilingual representation.
The study explores trade-offs in vocabulary size, model capacity, and language sampling, revealing performance limitations due to cross-lingual interference.

Unsupervised Cross-lingual Representation Learning at Scale: XLM-R

Introduction

This paper conducts an exhaustive study on unsupervised cross-lingual representation learning at scale and proposes XLM-R, a Transformer-based multilingual LLM trained on raw CommonCrawl data in 100 languages. The work addresses previously unsolved limitations in scaling multilingual representation models in both the pretraining data distribution and model capacity. It systematically explores the trade-offs involved in increasing language coverage, vocabulary size, and training set heterogeneity. The empirical results span cross-lingual understanding, question answering, and sequence labeling, with a focus on the XNLI and MLQA benchmarks, multilingual NER, and the GLUE leaderboard for English.

Model Design and Corpus Construction

XLM-R is an architecture-level modification of the RoBERTa framework, with the switch from BPE to SentencePiece tokenization and the adoption of a significantly larger shared vocabulary (250k tokens) to enable feasible modeling across 100 languages. The central contribution is the construction of the CC-100 dataset: a massive collection from CommonCrawl designed to provide comparable coverage for both high- and low-resource languages. Extensive corpus statistics show orders-of-magnitude increases in available data for low-resource languages compared to previous Wikipedia-based datasets.

The architectural scale includes XLM-R (550M parameters) and XLM-R_base (270M parameters), with significantly larger embedding layers. Ablation studies examine the consequences of vocabulary size, language sampling strategies, and model capacity for multilingual transfer.

Empirical Results

XLM-R establishes new state-of-the-art results on XNLI, MLQA, and CoNLL-2002/2003 NER benchmarks, demonstrating superior average cross-lingual transfer performance over both monolingual and previous multilingual models. Notably, in the TRANSLATE-TRAIN-ALL setting on XNLI, XLM-R achieves an average accuracy of 83.6% over 15 languages—a substantial improvement relative to prior methods (1911.02116).

On the MLQA zero-shot cross-lingual QA task, XLM-R's F1/EM scores lead on all seven tested languages, with improvements ranging from +6 to +12 points over mBERT and significant gains over XLM-15. For NER, XLM-R also outperforms previous approaches and mBERT, especially under cross-lingual transfer.

In the monolingual English domain (GLUE dev), XLM-R matches or closely tracks large monolingual models such as XLNet and RoBERTa, demonstrating competitive capacity despite its multilinguality. The experiments make strong claims regarding the feasibility of a single, unsupervisedly pretrained multilingual model matching or exceeding specialized monolingual models on large-scale English benchmarks.

Analysis of Scaling Laws and Transfer

The study systematically quantifies several phenomena:

Curse of multilinguality: Both overall performance and low-resource language accuracy benefit as the model increases language coverage and training data, up to a point where overloading the parameter budget leads to cross-lingual interference and performance saturation.
Role of capacity: Enhanced model capacity partially offsets the curse of multilinguality, but interference effects persist at extreme scales.
Vocabulary size: Larger vocabularies in SentencePiece mitigate rare-word problems in low-resource languages but can induce learning inefficiencies if over-expanded relative to capacity.
Language/data sampling: Adjusted batch-language sampling improves low-resource language performance without substantially degrading high-resource languages.

Practical and Theoretical Implications

The results strongly support the proposition that large-scale unsupervised pretraining over massive heterogeneous data and many languages yields models with robust emergent cross-lingual transfer. The findings challenge assumptions about the necessity of parallel data or aligned corpora for high cross-lingual representations, positioning domain coverage and unsupervised scale as dominant factors.

Practically, XLM-R's public release sets a new standard for accessible, high-quality cross-lingual models, closing the gap for low-resource languages in many NLP tasks. Theoretically, the work quantifies the limits of parameter sharing and documents the onset of interference, laying a foundation for further research into dynamic capacity allocation and adaptive tokenization. Results suggest unsupervised objectives scale remarkably well with data size and heterogeneity, but future work must address the efficient use of capacity and the mitigation of language interference, especially as model sizes become computationally prohibitive.

Conclusion

XLM-R substantiates the hypothesis that unsupervised multilingual pretraining, when scaled appropriately in both data and parameters, can produce models with unprecedented transfer learning performance across over 100 languages. The release of CC-100 and extensive empirical studies provide a new baseline for the field. However, the curse of multilinguality and interference trade-offs remain unsolved at extreme scale, motivating further work on scalable architectures and adaptive modeling strategies for truly universal cross-lingual representations.