The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

Published 1 Jan 2026 in cs.CL | (2601.00364v1)

Abstract: Multilingual LLMs achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that a minute fraction of parallel bilingual documents is critical, as its inclusion restores up to 91% of BLEU performance lost without such data.
Experimental results reveal that code-switched documents yield only marginal BLEU score improvements, while monolingual pretraining preserves robust sentence-level semantic alignment.
The findings imply that curating high-quality parallel corpora is more impactful than abundant code-switched content, guiding resource allocation in multilingual LLM pretraining.

The Role of Mixed-Language Documents in Multilingual LLM Pretraining

Motivation and Problem Statement

The multilingual capabilities of LLMs pretrained on large-scale web corpora have been widely demonstrated, yet the specific mechanisms and data requirements underlying cross-lingual transfer remain inadequately disentangled. In particular, the community has typically assumed that the presence of mixed-language documents—spanning parallel translations, code-switching, and miscellaneous bilingual text—uniformly supports cross-lingual competence, especially for transfer and machine translation. However, the actual contribution of these various types of bilingual exposure during pretraining lacks systematic quantification.

This work provides a granular analysis of the role of mixed-language documents by constructing both a monolingual corpus (MonoWeb) with all multilingual documents filtered out and controlled bilingual reintroductions into this baseline. The pretraining and evaluation protocol isolates the impact of parallel and code-switched documents on machine translation, cross-lingual QA, and multilingual understanding/reasoning tasks.

Pretraining Corpus Construction and Categorization of Bilingual Data

The authors build their corpus by sampling 60B tokens per language from FineWeb-Edu (English) and FineWeb2 (German, Spanish, French), resulting in a 240B-token multilingual corpus. Bilingual documents are identified through a two-stage process: entropy-based rule filtering across sentence-level fastText language predictions, followed by a Llama-3.3-70B-Instruct classifier for semantic categorization.

Bilingual documents are subdivided into:

Parallel: Segment-level alignment across languages (typically translations), ~14% of the bilingual subset.
Code-switching: Natural alternation of languages within discourse, ~72%.
Miscellaneous: Boilerplate or noise without functional semantic alignment, ~14%.

Notably, bilingual documents account for just 2% of the corpus, and most are user-generated, code-switched content, while parallel data are concentrated in academic, dictionary, and professional sources.

Experimental Design

Multilingual LLMs (1.35B parameters, Llama-2 tokenizer) are pretrained from scratch under four conditions for each language pair (en-fr, en-de, en-es):

FineWeb: Complete corpus including all bilingual documents.
MonoWeb: All bilingual documents removed.
MonoWeb+Parallel: Only parallel documents reintroduced.
MonoWeb+CodeSwitch: Only code-switched documents reintroduced.

A downstream suite covers: machine translation (WMT, FLORES), cross-lingual QA (XQuAD, MLQA), and general reasoning (XNLI, PAWS-X, HellaSwag, ARC, TruthfulQA, XStoryCloze, XWinograd).

Main Empirical Findings

Machine Translation: Critical Sensitivity to Parallel Data

Removing all bilingual data (MonoWeb) results in a 56% drop in BLEU score for machine translation (e.g., from 22.3 to 9.8 BLEU in aggregate), revealing an acute dependence on parallel documents despite their being less than 0.3% of the entire corpus. The reintroduction of parallel data (MonoWeb+Parallel) restores 91% of original BLEU performance (20.2 BLEU), while code-switching documents provide only marginal improvement (MonoWeb+CodeSwitch, 12.4 BLEU)—even though they are the majority of the bilingual subset.

Figure 1: Machine translation performance on WMT14 is almost entirely determined by the inclusion of parallel data in pretraining.

Further analysis demonstrates profound degradation of both output language fidelity (models fail to produce target language text in 55% of cases without parallel data) and semantic adequacy (severe underspecification at the lexical level), even when the core meaning is preserved. This failure mode shows translation collapse is not just a function of modeling the target language, but of establishing robust lexical alignment.

Cross-Lingual QA and Reasoning: Minimal Sensitivity to Bilingual Documents

In stark contrast, the removal of all mixed-language data produces only a moderate (≤10%) decline in cross-lingual QA (XQuAD), with MLQA and all tested reasoning tasks (XNLI, PAWS-X, HellaSwag, ARC, etc.) showing negligible or no loss relative to the full bilingual exposure setup.

Figure 2: XQuAD (a) shows some separation by dataset, but MLQA (b) and HellaSwag (d) show nearly overlapping results, indicating limited dependence on bilingual pretraining data.

This cross-task asymmetry is underpinned by representational analyses: removing mixed-language documents disrupts lexical-level cross-lingual alignment (cosine similarity on mapped word pairs), but sentence-level semantics remain robustly aligned from monolingual data alone. This explains why high-level understanding and reasoning transfer is immune to the collapse observed in machine translation.

Implications and Theoretical/Practical Consequences

These findings invalidate the assumption that simple code-switched exposure or general mixed-language noise provides significant gains for translation capacity or for cross-lingual transfer at large. Instead, explicit token-level alignment (parallel data) is necessary and sufficient for robust machine translation, whereas most cross-lingual tasks relying on semantics, abstraction, or reasoning do not require any bilingual documents in pretraining.

From an engineering standpoint, this indicates that efforts and resources for multilingual LLM pretraining should focus on curating high-quality parallel corpora (even at small scale) to boost translation performance, rather than augmenting with more code-switched or incidental bilingual content. For interpretability, these results reinforce the notion that sentence-level representations are shaped by higher-level task objectives and can be aligned even without explicit cross-lingual supervision.

Future Directions

Open questions remain on the scaling of these effects to larger model sizes (e.g., 7B+ parameters, more general pretraining scale laws), to typologically distant or low-resource languages, and with finer-grained control of document type, domain, and noise. Extending this work to broader language families, synthetic alignment data, or other pretraining objectives could further delineate the conditional requirements for LLM cross-lingual transfer.

Conclusion

This study provides a decisive quantification of the functional roles of mixed-language documents for multilingual LLM pretraining. It demonstrates that accurate machine translation is critically dependent on a minute fraction of parallel bilingual documents, while higher-level cross-lingual understanding and reasoning tasks remain largely agnostic to bilingual exposure. The implications span both the design of multilingual corpora and the theoretical understanding of alignment phenomena in LLMs, establishing the necessity of explicit parallel data for translation while showing the surprising sufficiency of monolingual data for semantic transfer.

Markdown