EuroLLM: European Multilingual LLM Suite
- EuroLLM is a family of dense, decoder-only Transformer models explicitly designed for robust multilingual tasks across all 24 EU languages and additional regional languages.
- The models employ advanced techniques such as grouped-query attention, SwiGLU activation, and rotary positional embeddings, scaling from 1.7B to 22B parameters.
- Evaluations show strong performance in translation, reasoning, and entity preservation, outperforming many open LLM baselines on key multilingual benchmarks.
EuroLLM refers to a family of LLMs architected, trained, and evaluated explicitly to provide strong language modeling, reasoning, and translation capabilities across all 24 official languages of the European Union (EU), as well as several additional regional and global languages. The EuroLLM suite is positioned as the most comprehensive and open European LLM initiative, targeting the historic underrepresentation of European languages in both proprietary and open-weight LLMs. The project encompasses a series of models of increasing scale (EuroLLM-1.7B, EuroLLM-9B, EuroLLM-22B) and releases all models, tokenizers, pre- and post-training data, and evaluation scripts under permissive open-source licenses (Martins et al., 2024, Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026).
1. Model Architecture and Scaling
EuroLLM models adopt a dense, decoder-only Transformer architecture, integrating state-of-the-art design trends established in the LLM community. Each generation advances both scale and architectural complexity:
| Model | Layers | Hidden Dim. | FFN Size | Attn. Heads / KV Heads | Params (B) | Max Context | RoPE θ |
|---|---|---|---|---|---|---|---|
| EuroLLM-1.7B | 24 | 2,048 | 5,632 | 16 / 8 (GQA) | 1.657 | 4,096 | 10,000 |
| EuroLLM-9B | 42 | 4,096 | 12,288 | 32 / 8 (GQA) | 9.153 | 4,096 | 10,000 |
| EuroLLM-22B | 54 | 6,144 | 16,384 | 48 / 8 (GQA) | 22.639 | 32,768 | 10⁶ |
Key architectural features include grouped-query attention (GQA) for memory efficiency, SwiGLU activation, pre-layer RMSNorm, rotary positional embeddings (RoPE), untied token embeddings, and no explicit dropout or weight decay beyond normalization approaches. With scaling, the models maintain 128,000-token multilingual SentencePiece tokenizers and consistent context window extensions, culminating in EuroLLM-22B supporting 32,768-token contexts with appropriately scaled RoPE (Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026, Martins et al., 2024).
2. Data Collection, Filtering, and Curriculum
The EuroLLM pretraining corpus comprises approximately 4 trillion tokens per model, sourced to maximize both quality and linguistic diversity:
Data breakdown:
- Web data: FineWeb-Edu, RedPajama-Data-v2, HPLT, MADLAD-400, CulturaX, mC4
- Parallel corpora: Europarl v8, ParaCrawl v9/7.1, CCMatrix/Aligned, OPUS family, MultiCCAligned, WikiMatrix, UN datasets, Tatoeba, and domain-specific document-level bitext (Europarl, ParaDocs)
- High-quality data: Wikipedia (all languages), ArXiv, BooksCorpus, Apollo, US-PD-Books
- Code/math: The Stack, Algebraic-stack, OpenWebMath, Python-Edu, GSM8K, OpenWebMath, synthetic math Q&A from Qwen2.5, SlimOrca, Gemma2, MetaMathQA
Filtering is stringent, combining deduplication, language identification, perplexity filtering (KenLM), and quality scoring via Bicleaner, CometKiwi-22, and EuroFilter (a multilingual, mDeBERTa-based classifier fine-tuned on cross-lingual web quality scores). Token distributions are tuned dynamically in a three-phase curriculum, emphasizing English and code/math early, then increasingly upsampling low-resource European languages and parallel/textbook-quality data in the final 10% of tokens ("annealing") (Ramos et al., 5 Feb 2026, Martins et al., 4 Jun 2025, Martins et al., 2024).
3. Tokenizer Design and Multilingual Fertility
EuroLLM models employ a byte-fallback BPE SentencePiece tokenizer trained over the entire multilingual pretraining corpus, with a 128,000-subword vocabulary. This design achieves balanced "fertility" (average subwords per word) across all EU languages, with median fertility typically 1.2–1.4 per word—comparable to other high-capacity multilingual models but with significant embedding-size savings relative to 256k-vocab tokenizers. No language-specific preprocessing is imposed beyond Unicode normalization and byte fallback, enabling direct modeling of orthographically diverse European and adjacent languages (Martins et al., 2024, Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026).
4. Training Methodologies and Scaling Laws
Training is conducted using Adam(bfloat16), with large effective batch sizes (e.g., 12M tokens per batch, up to 400 H100 GPUs for EuroLLM-9B) and the "trapezoid" or 3-phase cosine learning-rate schedule. No explicit dropout or weight decay is required due to RMSNorm stability. The models are trained to full 4T-token budgets, with later phases increasing curriculum stringency and upweighting valuable low-resource data.
Training decisions—parallel data allocation, repetition of high-quality sources, and language weighting—are quantitatively guided by the multilingual scaling law:
where is non-embedding params, the data fraction, and the expected cross-entropy loss. Empirical fitting on small-scale runs informs full-scale pipeline configuration; parallel fractions near 20%, high-quality source repetition, and language balancing are all justified by observed loss improvements (Martins et al., 2024, Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026).
5. Instruction Tuning and Synthetic Datasets
Instruction tuning is performed as single-stage supervised fine-tuning (SFT) using the EuroBlocks-SFT dataset—now >10M instruction–response pairs—spanning a wide topical and linguistic range. EuroBlocks comprises both human and synthetic examples: high-quality chat/dialogue (Magpie, Aya, LMSYS-Chat-1M), code/math/STEM tasks (OpenMath-2, Nemotron), and Tower-translated SFT prompt/answer pairs. Synthetic data is generated with Llama-3 and early EuroLLM checkpoints, using contextually structured prompts and retrieval-augmented answer logic. Filtering is enforced by reward models (ARMORM) and LLM-as-a-judge validation steps. Post-training yields substantial performance gains, particularly for low-resource and instructional tasks (Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026).
6. Evaluation and Benchmarking
EuroLLM models are evaluated on a spectrum of multilingual and reasoning benchmarks:
- General QA and reasoning: ARC-Easy/Challenge, HellaSwag, MMLU, MMLU-Pro, BBH, TruthfulQA, IFEval, EU20-Benchmarks (all in translated or multilingual settings).
- Machine translation: WMT24++, FLORES-200, WMT23/24, all assessed via COMET-22.
- Multilingual entity transfer: Detailed in the "Do Not Change Me" study, EuroLLM-9B achieves macro-average accuracy of 95.89% in unmodified entity preservation across nine categories (URLs, IBANs, emails, IPs, ISBNs, phone, social handlers, emojis, alphanumerics and directions EN, DE, PL, UK)—exceeding Google Translate on most entity types and outperforming other open LLMs by 10–50pp depending on entity and language.
| Model | Macro-avg Entity Copy Acc. | MT COMET-22 (EN→XX/XX→EN) | HellaSwag/MMLU (EU langs) | FLORES-200/ARC/IFEval |
|---|---|---|---|---|
| EuroLLM-9B | 95.89% | 84.19/83.94 | ties or leads Gemma-2-9B | leads EU open baselines |
| EuroLLM-22B | >95% on entities | Stable, competitive EU MT | +4–10pt over 9B on QA/STEM | best open European model |
Instruction tuning further boosts performance, particularly on structured QA and instruction-following, while dedicated prompting and entity tokenization techniques can raise entity transfer accuracy by up to 6pp on smaller models (Wisniewski et al., 9 May 2025, Martins et al., 4 Jun 2025, Ramos et al., 5 Feb 2026, Martins et al., 2024).
7. Insights from Longitudinal and Causal Analysis
Causal interpretability studies using activation patching demonstrate that EuroLLM models develop language-agnostic concept spaces early in their training—within the first 10% of tokens. These shared spaces continually refine in response to multi-aligned parallel data and drive strong cross-lingual alignment, especially for high-resource target languages. Fine-grained analysis reveals that part of the observed improvements in translation and entity transfer reflect behavioral shifts (e.g., sense disambiguation in polysemous words; suppression of cross-lingual homograph copying) rather than naive lexical matching.
The alignment progression is language-dependent: English and other high-resource languages achieve alignment gains 10–20pp above baseline by early checkpoints, while low-resource cases require multi-way parallel datasets to achieve similar gains. Methodologically, activation patching is most informative for simple, concept-level interventions; current approaches show limitations for complex or multi-sentence settings (Körner et al., 30 Jan 2026).
8. Open Release, Limitations, and Outlook
All EuroLLM models, datasets (EuroBlocks, EuroWeb), tokenizers, training code (Megatron-LM/Axolotl), and evaluation pipelines are open-sourced under permissive MIT-like licenses (Hugging Face/utter-project). Documented limitation include the gap to the largest open LLMs on certain tasks, the need for additional high-quality web and domain-specific data, and yet-unexplored RLHF or further reward modeling for alignment. Context-length extensions open new research avenues in retrieval and long-form modeling. Planned future work targets scaling beyond 33B parameters, improving low-resource coverage, and expanding to additional European dialects and modalities (Ramos et al., 5 Feb 2026, Martins et al., 4 Jun 2025, Martins et al., 2024).