EuroLLM-22B: Open Multilingual LLM
- EuroLLM-22B is a large, multilingual language model with 22B parameters trained from scratch to support 35 diverse languages.
- It features a unidirectional, decoder-only Transformer architecture with 54 layers and grouped query attention for efficient long-context handling.
- The model is pre-trained on 4 trillion tokens over a three-phase curriculum, achieving state-of-the-art performance in multilingual reasoning, translation, and instruction following.
EuroLLM-22B is a large, multilingual, open-access LLM comprising approximately 22 billion parameters. It is trained from scratch to natively support all twenty-four official languages of the European Union, in addition to eleven further high- and low-resource languages. The model is optimized to address the persistent underrepresentation of European languages in existing open LLMs. EuroLLM-22B demonstrates strong performance in multilingual reasoning, instruction following, and translation benchmarks, and all model checkpoints, data artifacts, and codebases are released under fully open terms (Ramos et al., 5 Feb 2026).
1. Tokenizer and Subword Vocabulary
EuroLLM-22B employs a Byte-Pair Encoding (BPE) tokenizer, inherited from previous EuroLLM releases at 1.7B and 9B scale. The tokenizer operates directly on raw text, without the requirement for language-specific pretokenization, and is devised to produce subword units that balance granularity between whole words and individual characters. The vocabulary contains tokens, which was selected to efficiently capture the inflectional morphology and named entities typical of the 35 target languages, while containing model and sequence complexity.
Token coverage in the pretraining corpus is measured by cumulative frequency:
where denotes the frequency of the -th subword. At , coverage reaches approximately 99.3% of observed byte-pair tokens. This vocabulary design enables efficient handling of both high- and low-resource European languages.
2. Model Architecture
EuroLLM-22B is constructed as a unidirectional, decoder-only Transformer with 54 layers and a hidden size of 6,144. The total parameter count is 22.639 billion, divided between 0.768B embedding parameters and 21.067B non-embedding parameters. The feed-forward network utilizes a hidden size of 16,384. The attention mechanism implements 48 attention heads with Grouped Query Attention (GQA) using eight key/value heads. Rotary Position Embedding (RoPE) scaling is used, with the final training phase employing to enable sequence lengths up to 32K tokens.
Key architectural features include:
- Grouped Query Attention (GQA) with eight key/value heads for inference efficiency and preserved quality.
- Pre-layer normalization via RMSNorm.
- SwiGLU activations in the feed-forward paths.
- RoPE with extended scaling for long-context operation.
- No weight tying between embeddings and output heads, maintaining flexibility in representation.
The parameter count is given by:
where the first term covers token embedding and output layers and the second term is the sum across Transformer layers each with parameters.
| 1.7B | 9B | 22B | |
|---|---|---|---|
| Layers | 24 | 42 | 54 |
| Hidden size | 2048 | 4096 | 6144 |
| FFN size | 5632 | 12288 | 16384 |
| Attn heads | 16 | 32 | 48 |
| KV (GQA) | 8 | 8 | 8 |
| Rotary θ | 10⁴ | 10⁴ | 10⁶ |
| Total params | 1.657B | 9.153B | 22.639B |
3. Pre-training Data and Quality Control
Pre-training data totals approximately 4 trillion tokens and is sourced across a three-phase curriculum: coarse, medium, and fine, with successively greater data quality.
Phase 1 (3.6T tokens):
Includes English documents from "FineWeb-Edu" with educational score ≥2 and multilingual web data filtered using perplexity and additional heuristics.
Phase 2 (~200B tokens):
Leverages Nemotron-CC splits, document-level parallel corpora (Europarl, ParaDocs), and STEM/code resources.
Phase 3 (~200B tokens):
Incorporates fine-tuned high-quality web content, synthetic math data (Qwen-2.5-Math), document-level translation data, and long-context resources (books, code) upsampled for 32K context handling.
For major EU languages (German, French, Spanish, Italian), data derives from RedPajama-Data-v2 and is filtered sequentially on KenLM perplexity, minimum character length, exclusion of synthetic or code-dominated content, uppercase ratio, symbol-to-word ratio, and non-alphabetic fraction. Across all languages, an mDeBERTa-based EuroFilter classifier assigns quality scores (0–5), creating three quality tiers.
Parallel data are extracted from over 40 public corpora (e.g., Europarl v8, ParaCrawl v9, CCMatrix v1, WikiMatrix v1, OPUS100 v1) and processed with Bifixer, Bicleaner (higher threshold for Portuguese), and CometKiwi () to ensure translation accuracy.
4. Training Methodology
Training is performed using Megatron-LM, with comprehensive data and model parallelism distributed across EuroHPC clusters. The objective is causal language modeling with cross-entropy loss, optimized by Adam. The learning-rate schedule consists of three phases:
- Linear warmup to over 10% of tokens.
- Constant over the next 2T tokens.
- Linear decay to over 400B tokens, then to zero.
Context window increases from 4K to 32K tokens during phase 3 via RoPE scaling. Checkpoints are saved every 10B tokens, with the final model selected for post-training.
5. Evaluation and Benchmarking
EuroLLM-22B and its instruction-tuned variant, EuroLLM-22B-Instruct, are evaluated on a suite of multilingual and reasoning-focused benchmarks:
- Instruction following: IFEval
- General knowledge: HellaSwag, MMLU, MMLU-ProX, BBH
- STEM: ARC-C, GPQA◇, GSM8K, MATH-500, HumanEval
- Multilingual: mHellaSwag, MMMLU, MMLU-ProX
- Multilingual STEM: mARC-C, MGSM
- Translation: FLORES-200, WMT24++, WMT25
English Results (Table 3):
| Model | IFEval | HellaSwag | MMLU | MMLU-Pro | ARC-C | GSM8K | MATH-500 | HumanEval |
|---|---|---|---|---|---|---|---|---|
| EuroLLM-9B | 62.4 | 53.0 | 65.5 | 42.3 | 85.9 | 74.6 | 36.9 | 50.8 |
| EuroLLM-22B | 67.2 | 69.7 | 69.8 | 50.8 | 89.8 | 85.5 | 54.5 | 53.9 |
| Apertus-8B | 59.1 | 58.1 | 57.3 | 32.7 | 75.5 | 67.7 | 26.9 | 39.0 |
| Apertus-70B | 61.2 | 74.6 | 67.9 | 41.9 | 84.7 | 80.0 | 42.3 | 44.5 |
Multilingual Results (Table 4) — Aggregated over 24 EU languages:
| Model | HellaSwag | MMMLU | ARC-C | FLORES | WMT24++ |
|---|---|---|---|---|---|
| EuroLLM-9B | 49.9 | 61.5 | 80.7 | 88.9 | 83.6 |
| EuroLLM-22B | 62.6 | 65.6 | 84.1 | 88.9 | 83.9 |
| Apertus-8B | 50.9 | 54.0 | 71.0 | 87.8 | 81.5 |
| Apertus-70B | 68.6 | 61.7 | 79.6 | 85.1 | 76.0 |
Ablation with the updated post-training (EuroBlocks-SFT-2512) yields consistent +5–10 point improvements in reasoning and instruction-following tasks, with translation quality unaffected.
6. Released Resources and Reproducibility
All models, training corpora, and evaluation code are released openly on HuggingFace under the “EuroLLM” collection. Specifically, releases include:
- Base and instruction-tuned models (EuroLLM-22B, EuroLLM-22B-Instruct, improved 9B variants)
- EuroWeb multilingual web pretraining data (three quality tiers)
- EuroBlocks-SFT-2512 instruction data (approximately 10.6 million multilingual examples)
- Megatron-LM fork for pretraining and the eurollm-eval evaluation suite
This emphasizes full reproducibility for the multilingual LLM community and supports extensibility to further European and non-European languages.
7. Significance and Impact
EuroLLM-22B establishes state-of-the-art performance among fully-open European multilingual LLMs at this scale, exceeding prior open models in general reasoning, STEM, and instruction-following benchmarks across English and 24 EU languages. The integration of a balanced BPE vocabulary, rigorous multi-phase data filtering, scale-efficient attention mechanisms, and long-context support addresses longstanding gaps in high-quality, openly accessible multilingual LLM resources for Europe. The openly released data, models, and codebases are explicitly designed to foster research and technology development serving the full linguistic diversity of Europe (Ramos et al., 5 Feb 2026).