Papers
Topics
Authors
Recent
Search
2000 character limit reached

EuroLLM: European Multilingual LLM Suite

Updated 6 February 2026
  • EuroLLM is a family of dense, decoder-only Transformer models explicitly designed for robust multilingual tasks across all 24 EU languages and additional regional languages.
  • The models employ advanced techniques such as grouped-query attention, SwiGLU activation, and rotary positional embeddings, scaling from 1.7B to 22B parameters.
  • Evaluations show strong performance in translation, reasoning, and entity preservation, outperforming many open LLM baselines on key multilingual benchmarks.

EuroLLM refers to a family of LLMs architected, trained, and evaluated explicitly to provide strong language modeling, reasoning, and translation capabilities across all 24 official languages of the European Union (EU), as well as several additional regional and global languages. The EuroLLM suite is positioned as the most comprehensive and open European LLM initiative, targeting the historic underrepresentation of European languages in both proprietary and open-weight LLMs. The project encompasses a series of models of increasing scale (EuroLLM-1.7B, EuroLLM-9B, EuroLLM-22B) and releases all models, tokenizers, pre- and post-training data, and evaluation scripts under permissive open-source licenses (Martins et al., 2024, Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026).

1. Model Architecture and Scaling

EuroLLM models adopt a dense, decoder-only Transformer architecture, integrating state-of-the-art design trends established in the LLM community. Each generation advances both scale and architectural complexity:

Model Layers Hidden Dim. FFN Size Attn. Heads / KV Heads Params (B) Max Context RoPE θ
EuroLLM-1.7B 24 2,048 5,632 16 / 8 (GQA) 1.657 4,096 10,000
EuroLLM-9B 42 4,096 12,288 32 / 8 (GQA) 9.153 4,096 10,000
EuroLLM-22B 54 6,144 16,384 48 / 8 (GQA) 22.639 32,768 10⁶

Key architectural features include grouped-query attention (GQA) for memory efficiency, SwiGLU activation, pre-layer RMSNorm, rotary positional embeddings (RoPE), untied token embeddings, and no explicit dropout or weight decay beyond normalization approaches. With scaling, the models maintain 128,000-token multilingual SentencePiece tokenizers and consistent context window extensions, culminating in EuroLLM-22B supporting 32,768-token contexts with appropriately scaled RoPE (Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026, Martins et al., 2024).

2. Data Collection, Filtering, and Curriculum

The EuroLLM pretraining corpus comprises approximately 4 trillion tokens per model, sourced to maximize both quality and linguistic diversity:

Data breakdown:

  • Web data: FineWeb-Edu, RedPajama-Data-v2, HPLT, MADLAD-400, CulturaX, mC4
  • Parallel corpora: Europarl v8, ParaCrawl v9/7.1, CCMatrix/Aligned, OPUS family, MultiCCAligned, WikiMatrix, UN datasets, Tatoeba, and domain-specific document-level bitext (Europarl, ParaDocs)
  • High-quality data: Wikipedia (all languages), ArXiv, BooksCorpus, Apollo, US-PD-Books
  • Code/math: The Stack, Algebraic-stack, OpenWebMath, Python-Edu, GSM8K, OpenWebMath, synthetic math Q&A from Qwen2.5, SlimOrca, Gemma2, MetaMathQA

Filtering is stringent, combining deduplication, language identification, perplexity filtering (KenLM), and quality scoring via Bicleaner, CometKiwi-22, and EuroFilter (a multilingual, mDeBERTa-based classifier fine-tuned on cross-lingual web quality scores). Token distributions are tuned dynamically in a three-phase curriculum, emphasizing English and code/math early, then increasingly upsampling low-resource European languages and parallel/textbook-quality data in the final 10% of tokens ("annealing") (Ramos et al., 5 Feb 2026, Martins et al., 4 Jun 2025, Martins et al., 2024).

3. Tokenizer Design and Multilingual Fertility

EuroLLM models employ a byte-fallback BPE SentencePiece tokenizer trained over the entire multilingual pretraining corpus, with a 128,000-subword vocabulary. This design achieves balanced "fertility" (average subwords per word) across all EU languages, with median fertility typically 1.2–1.4 per word—comparable to other high-capacity multilingual models but with significant embedding-size savings relative to 256k-vocab tokenizers. No language-specific preprocessing is imposed beyond Unicode normalization and byte fallback, enabling direct modeling of orthographically diverse European and adjacent languages (Martins et al., 2024, Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026).

4. Training Methodologies and Scaling Laws

Training is conducted using Adam(bfloat16), with large effective batch sizes (e.g., 12M tokens per batch, up to 400 H100 GPUs for EuroLLM-9B) and the "trapezoid" or 3-phase cosine learning-rate schedule. No explicit dropout or weight decay is required due to RMSNorm stability. The models are trained to full 4T-token budgets, with later phases increasing curriculum stringency and upweighting valuable low-resource data.

Training decisions—parallel data allocation, repetition of high-quality sources, and language weighting—are quantitatively guided by the multilingual scaling law:

L(N,p)=f(p)βNα+L,f(p)=p+c1pc2(1p)c3L(N,p) = f(p)\,\beta\,N^{-\alpha} + L_\infty\,,\quad f(p)=p + c_1p^{c_2}(1-p)^{c_3}

where NN is non-embedding params, pp the data fraction, and L(N,p)L(N,p) the expected cross-entropy loss. Empirical fitting on small-scale runs informs full-scale pipeline configuration; parallel fractions near 20%, high-quality source repetition, and language balancing are all justified by observed loss improvements (Martins et al., 2024, Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026).

5. Instruction Tuning and Synthetic Datasets

Instruction tuning is performed as single-stage supervised fine-tuning (SFT) using the EuroBlocks-SFT dataset—now >10M instruction–response pairs—spanning a wide topical and linguistic range. EuroBlocks comprises both human and synthetic examples: high-quality chat/dialogue (Magpie, Aya, LMSYS-Chat-1M), code/math/STEM tasks (OpenMath-2, Nemotron), and Tower-translated SFT prompt/answer pairs. Synthetic data is generated with Llama-3 and early EuroLLM checkpoints, using contextually structured prompts and retrieval-augmented answer logic. Filtering is enforced by reward models (ARMORM) and LLM-as-a-judge validation steps. Post-training yields substantial performance gains, particularly for low-resource and instructional tasks (Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026).

6. Evaluation and Benchmarking

EuroLLM models are evaluated on a spectrum of multilingual and reasoning benchmarks:

  • General QA and reasoning: ARC-Easy/Challenge, HellaSwag, MMLU, MMLU-Pro, BBH, TruthfulQA, IFEval, EU20-Benchmarks (all in translated or multilingual settings).
  • Machine translation: WMT24++, FLORES-200, WMT23/24, all assessed via COMET-22.
  • Multilingual entity transfer: Detailed in the "Do Not Change Me" study, EuroLLM-9B achieves macro-average accuracy of 95.89% in unmodified entity preservation across nine categories (URLs, IBANs, emails, IPs, ISBNs, phone, social handlers, emojis, alphanumerics and directions EN, DE, PL, UK)—exceeding Google Translate on most entity types and outperforming other open LLMs by 10–50pp depending on entity and language.
Model Macro-avg Entity Copy Acc. MT COMET-22 (EN→XX/XX→EN) HellaSwag/MMLU (EU langs) FLORES-200/ARC/IFEval
EuroLLM-9B 95.89% 84.19/83.94 ties or leads Gemma-2-9B leads EU open baselines
EuroLLM-22B >95% on entities Stable, competitive EU MT +4–10pt over 9B on QA/STEM best open European model

Instruction tuning further boosts performance, particularly on structured QA and instruction-following, while dedicated prompting and entity tokenization techniques can raise entity transfer accuracy by up to 6pp on smaller models (Wisniewski et al., 9 May 2025, Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026, Martins et al., 2024).

7. Insights from Longitudinal and Causal Analysis

Causal interpretability studies using activation patching demonstrate that EuroLLM models develop language-agnostic concept spaces early in their training—within the first 10% of tokens. These shared spaces continually refine in response to multi-aligned parallel data and drive strong cross-lingual alignment, especially for high-resource target languages. Fine-grained analysis reveals that part of the observed improvements in translation and entity transfer reflect behavioral shifts (e.g., sense disambiguation in polysemous words; suppression of cross-lingual homograph copying) rather than naive lexical matching.

The alignment progression is language-dependent: English and other high-resource languages achieve alignment gains 10–20pp above baseline by early checkpoints, while low-resource cases require multi-way parallel datasets to achieve similar gains. Methodologically, activation patching is most informative for simple, concept-level interventions; current approaches show limitations for complex or multi-sentence settings (Körner et al., 30 Jan 2026).

8. Open Release, Limitations, and Outlook

All EuroLLM models, datasets (EuroBlocks, EuroWeb), tokenizers, training code (Megatron-LM/Axolotl), and evaluation pipelines are open-sourced under permissive MIT-like licenses (Hugging Face/utter-project). Documented limitation include the gap to the largest open LLMs on certain tasks, the need for additional high-quality web and domain-specific data, and yet-unexplored RLHF or further reward modeling for alignment. Context-length extensions open new research avenues in retrieval and long-form modeling. Planned future work targets scaling beyond 33B parameters, improving low-resource coverage, and expanding to additional European dialects and modalities (Ramos et al., 5 Feb 2026, Martins et al., 4 Jun 2025, Martins et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EuroLLM.