EuroLLM-22B: Open Multilingual LLM

Updated 6 February 2026

EuroLLM-22B is a large, multilingual language model with 22B parameters trained from scratch to support 35 diverse languages.
It features a unidirectional, decoder-only Transformer architecture with 54 layers and grouped query attention for efficient long-context handling.
The model is pre-trained on 4 trillion tokens over a three-phase curriculum, achieving state-of-the-art performance in multilingual reasoning, translation, and instruction following.

EuroLLM-22B is a large, multilingual, open-access LLM comprising approximately 22 billion parameters. It is trained from scratch to natively support all twenty-four official languages of the European Union, in addition to eleven further high- and low-resource languages. The model is optimized to address the persistent underrepresentation of European languages in existing open LLMs. EuroLLM-22B demonstrates strong performance in multilingual reasoning, instruction following, and translation benchmarks, and all model checkpoints, data artifacts, and codebases are released under fully open terms (Ramos et al., 5 Feb 2026).

1. Tokenizer and Subword Vocabulary

EuroLLM-22B employs a Byte-Pair Encoding (BPE) tokenizer, inherited from previous EuroLLM releases at 1.7B and 9B scale. The tokenizer operates directly on raw text, without the requirement for language-specific pretokenization, and is devised to produce subword units that balance granularity between whole words and individual characters. The vocabulary contains $V=128{,}000$ tokens, which was selected to efficiently capture the inflectional morphology and named entities typical of the 35 target languages, while containing model and sequence complexity.

Token coverage in the pretraining corpus is measured by cumulative frequency:

$\mathrm{Coverage}(K) = \frac{\sum_{i=1}^K f_i}{\sum_{i=1}^V f_i}$

where $f_i$ denotes the frequency of the $i$ -th subword. At $K = 50{,}000$ , coverage reaches approximately 99.3% of observed byte-pair tokens. This vocabulary design enables efficient handling of both high- and low-resource European languages.

2. Model Architecture

EuroLLM-22B is constructed as a unidirectional, decoder-only Transformer with 54 layers and a hidden size of 6,144. The total parameter count is 22.639 billion, divided between 0.768B embedding parameters and 21.067B non-embedding parameters. The feed-forward network utilizes a hidden size $d_{ff}$ of 16,384. The attention mechanism implements 48 attention heads with Grouped Query Attention (GQA) using eight key/value heads. Rotary Position Embedding (RoPE) scaling is used, with the final training phase employing $\theta = 10^6$ to enable sequence lengths up to 32K tokens.

Key architectural features include:

Grouped Query Attention (GQA) with eight key/value heads for inference efficiency and preserved quality.
Pre-layer normalization via RMSNorm.
SwiGLU activations in the feed-forward paths.
RoPE with extended scaling for long-context operation.
No weight tying between embeddings and output heads, maintaining flexibility in representation.

The parameter count $P$ is given by:

$P = 2 \cdot (d_{\text{model}} \times V) + N_\ell \times P_\ell$

where the first term covers token embedding and output layers and the second term is the sum across $N_\ell$ Transformer layers each with $\mathrm{Coverage}(K) = \frac{\sum_{i=1}^K f_i}{\sum_{i=1}^V f_i}$ 0 parameters.

	1.7B	9B	22B
Layers	24	42	54
Hidden size	2048	4096	6144
FFN size	5632	12288	16384
Attn heads	16	32	48
KV (GQA)	8	8	8
Rotary θ	10⁴	10⁴	10⁶
Total params	1.657B	9.153B	22.639B

3. Pre-training Data and Quality Control

Pre-training data totals approximately 4 trillion tokens and is sourced across a three-phase curriculum: coarse, medium, and fine, with successively greater data quality.

Phase 1 (3.6T tokens):

Includes English documents from "FineWeb-Edu" with educational score ≥2 and multilingual web data filtered using perplexity and additional heuristics.

Phase 2 (~200B tokens):

Leverages Nemotron-CC splits, document-level parallel corpora (Europarl, ParaDocs), and STEM/code resources.

Phase 3 (~200B tokens):

Incorporates fine-tuned high-quality web content, synthetic math data (Qwen-2.5-Math), document-level translation data, and long-context resources (books, code) upsampled for 32K context handling.

For major EU languages (German, French, Spanish, Italian), data derives from RedPajama-Data-v2 and is filtered sequentially on KenLM perplexity, minimum character length, exclusion of synthetic or code-dominated content, uppercase ratio, symbol-to-word ratio, and non-alphabetic fraction. Across all languages, an mDeBERTa-based EuroFilter classifier assigns quality scores (0–5), creating three quality tiers.

Parallel data are extracted from over 40 public corpora (e.g., Europarl v8, ParaCrawl v9, CCMatrix v1, WikiMatrix v1, OPUS100 v1) and processed with Bifixer, Bicleaner (higher threshold for Portuguese), and CometKiwi ( $\mathrm{Coverage}(K) = \frac{\sum_{i=1}^K f_i}{\sum_{i=1}^V f_i}$ 1) to ensure translation accuracy.

4. Training Methodology

Training is performed using Megatron-LM, with comprehensive data and model parallelism distributed across EuroHPC clusters. The objective is causal language modeling with cross-entropy loss, optimized by Adam. The learning-rate schedule consists of three phases:

Linear warmup to $\mathrm{Coverage}(K) = \frac{\sum_{i=1}^K f_i}{\sum_{i=1}^V f_i}$ 2 over 10% of tokens.
Constant $\mathrm{Coverage}(K) = \frac{\sum_{i=1}^K f_i}{\sum_{i=1}^V f_i}$ 3 over the next $\mathrm{Coverage}(K) = \frac{\sum_{i=1}^K f_i}{\sum_{i=1}^V f_i}$ 42T tokens.
Linear decay to $\mathrm{Coverage}(K) = \frac{\sum_{i=1}^K f_i}{\sum_{i=1}^V f_i}$ 5 over 400B tokens, then to zero.

Context window increases from 4K to 32K tokens during phase 3 via RoPE scaling. Checkpoints are saved every 10B tokens, with the final model selected for post-training.

5. Evaluation and Benchmarking

EuroLLM-22B and its instruction-tuned variant, EuroLLM-22B-Instruct, are evaluated on a suite of multilingual and reasoning-focused benchmarks:

Instruction following: IFEval
General knowledge: HellaSwag, MMLU, MMLU-ProX, BBH
STEM: ARC-C, GPQA◇, GSM8K, MATH-500, HumanEval
Multilingual: mHellaSwag, MMMLU, MMLU-ProX
Multilingual STEM: mARC-C, MGSM
Translation: FLORES-200, WMT24++, WMT25

English Results (Table 3):

Model	IFEval	HellaSwag	MMLU	MMLU-Pro	ARC-C	GSM8K	MATH-500	HumanEval
EuroLLM-9B	62.4	53.0	65.5	42.3	85.9	74.6	36.9	50.8
EuroLLM-22B	67.2	69.7	69.8	50.8	89.8	85.5	54.5	53.9
Apertus-8B	59.1	58.1	57.3	32.7	75.5	67.7	26.9	39.0
Apertus-70B	61.2	74.6	67.9	41.9	84.7	80.0	42.3	44.5

Multilingual Results (Table 4) — Aggregated over 24 EU languages:

Model	HellaSwag	MMMLU	ARC-C	FLORES	WMT24++
EuroLLM-9B	49.9	61.5	80.7	88.9	83.6
EuroLLM-22B	62.6	65.6	84.1	88.9	83.9
Apertus-8B	50.9	54.0	71.0	87.8	81.5
Apertus-70B	68.6	61.7	79.6	85.1	76.0

Ablation with the updated post-training (EuroBlocks-SFT-2512) yields consistent +5–10 point improvements in reasoning and instruction-following tasks, with translation quality unaffected.

6. Released Resources and Reproducibility

All models, training corpora, and evaluation code are released openly on HuggingFace under the “EuroLLM” collection. Specifically, releases include:

Base and instruction-tuned models (EuroLLM-22B, EuroLLM-22B-Instruct, improved 9B variants)
EuroWeb multilingual web pretraining data (three quality tiers)
EuroBlocks-SFT-2512 instruction data (approximately 10.6 million multilingual examples)
Megatron-LM fork for pretraining and the eurollm-eval evaluation suite

This emphasizes full reproducibility for the multilingual LLM community and supports extensibility to further European and non-European languages.

7. Significance and Impact

EuroLLM-22B establishes state-of-the-art performance among fully-open European multilingual LLMs at this scale, exceeding prior open models in general reasoning, STEM, and instruction-following benchmarks across English and 24 EU languages. The integration of a balanced BPE vocabulary, rigorous multi-phase data filtering, scale-efficient attention mechanisms, and long-context support addresses longstanding gaps in high-quality, openly accessible multilingual LLM resources for Europe. The openly released data, models, and codebases are explicitly designed to foster research and technology development serving the full linguistic diversity of Europe (Ramos et al., 5 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

EuroLLM-22B: Technical Report (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EuroLLM-22B.