Papers
Topics
Authors
Recent
Search
2000 character limit reached

EuroLLM-22B: Open Multilingual LLM

Updated 6 February 2026
  • EuroLLM-22B is a large, multilingual language model with 22B parameters trained from scratch to support 35 diverse languages.
  • It features a unidirectional, decoder-only Transformer architecture with 54 layers and grouped query attention for efficient long-context handling.
  • The model is pre-trained on 4 trillion tokens over a three-phase curriculum, achieving state-of-the-art performance in multilingual reasoning, translation, and instruction following.

EuroLLM-22B is a large, multilingual, open-access LLM comprising approximately 22 billion parameters. It is trained from scratch to natively support all twenty-four official languages of the European Union, in addition to eleven further high- and low-resource languages. The model is optimized to address the persistent underrepresentation of European languages in existing open LLMs. EuroLLM-22B demonstrates strong performance in multilingual reasoning, instruction following, and translation benchmarks, and all model checkpoints, data artifacts, and codebases are released under fully open terms (Ramos et al., 5 Feb 2026).

1. Tokenizer and Subword Vocabulary

EuroLLM-22B employs a Byte-Pair Encoding (BPE) tokenizer, inherited from previous EuroLLM releases at 1.7B and 9B scale. The tokenizer operates directly on raw text, without the requirement for language-specific pretokenization, and is devised to produce subword units that balance granularity between whole words and individual characters. The vocabulary contains V=128,000V=128{,}000 tokens, which was selected to efficiently capture the inflectional morphology and named entities typical of the 35 target languages, while containing model and sequence complexity.

Token coverage in the pretraining corpus is measured by cumulative frequency:

Coverage(K)=i=1Kfii=1Vfi\mathrm{Coverage}(K) = \frac{\sum_{i=1}^K f_i}{\sum_{i=1}^V f_i}

where fif_i denotes the frequency of the ii-th subword. At K=50,000K = 50{,}000, coverage reaches approximately 99.3% of observed byte-pair tokens. This vocabulary design enables efficient handling of both high- and low-resource European languages.

2. Model Architecture

EuroLLM-22B is constructed as a unidirectional, decoder-only Transformer with 54 layers and a hidden size of 6,144. The total parameter count is 22.639 billion, divided between 0.768B embedding parameters and 21.067B non-embedding parameters. The feed-forward network utilizes a hidden size dffd_{ff} of 16,384. The attention mechanism implements 48 attention heads with Grouped Query Attention (GQA) using eight key/value heads. Rotary Position Embedding (RoPE) scaling is used, with the final training phase employing θ=106\theta = 10^6 to enable sequence lengths up to 32K tokens.

Key architectural features include:

  • Grouped Query Attention (GQA) with eight key/value heads for inference efficiency and preserved quality.
  • Pre-layer normalization via RMSNorm.
  • SwiGLU activations in the feed-forward paths.
  • RoPE with extended scaling for long-context operation.
  • No weight tying between embeddings and output heads, maintaining flexibility in representation.

The parameter count PP is given by:

P=2(dmodel×V)+N×PP = 2 \cdot (d_{\text{model}} \times V) + N_\ell \times P_\ell

where the first term covers token embedding and output layers and the second term is the sum across NN_\ell Transformer layers each with PP_\ell parameters.

1.7B 9B 22B
Layers 24 42 54
Hidden size 2048 4096 6144
FFN size 5632 12288 16384
Attn heads 16 32 48
KV (GQA) 8 8 8
Rotary θ 10⁴ 10⁴ 10⁶
Total params 1.657B 9.153B 22.639B

3. Pre-training Data and Quality Control

Pre-training data totals approximately 4 trillion tokens and is sourced across a three-phase curriculum: coarse, medium, and fine, with successively greater data quality.

Phase 1 (3.6T tokens):

Includes English documents from "FineWeb-Edu" with educational score ≥2 and multilingual web data filtered using perplexity and additional heuristics.

Phase 2 (~200B tokens):

Leverages Nemotron-CC splits, document-level parallel corpora (Europarl, ParaDocs), and STEM/code resources.

Phase 3 (~200B tokens):

Incorporates fine-tuned high-quality web content, synthetic math data (Qwen-2.5-Math), document-level translation data, and long-context resources (books, code) upsampled for 32K context handling.

For major EU languages (German, French, Spanish, Italian), data derives from RedPajama-Data-v2 and is filtered sequentially on KenLM perplexity, minimum character length, exclusion of synthetic or code-dominated content, uppercase ratio, symbol-to-word ratio, and non-alphabetic fraction. Across all languages, an mDeBERTa-based EuroFilter classifier assigns quality scores (0–5), creating three quality tiers.

Parallel data are extracted from over 40 public corpora (e.g., Europarl v8, ParaCrawl v9, CCMatrix v1, WikiMatrix v1, OPUS100 v1) and processed with Bifixer, Bicleaner (higher threshold for Portuguese), and CometKiwi (0.7\geq 0.7) to ensure translation accuracy.

4. Training Methodology

Training is performed using Megatron-LM, with comprehensive data and model parallelism distributed across EuroHPC clusters. The objective is causal language modeling with cross-entropy loss, optimized by Adam. The learning-rate schedule consists of three phases:

  1. Linear warmup to ηmax=1.5×104\eta_{max}=1.5 \times 10^{-4} over 10% of tokens.
  2. Constant ηmax\eta_{max} over the next \sim2T tokens.
  3. Linear decay to ηmin=0.1ηmax\eta_{min}=0.1 \cdot \eta_{max} over 400B tokens, then to zero.

Context window increases from 4K to 32K tokens during phase 3 via RoPE scaling. Checkpoints are saved every 10B tokens, with the final model selected for post-training.

5. Evaluation and Benchmarking

EuroLLM-22B and its instruction-tuned variant, EuroLLM-22B-Instruct, are evaluated on a suite of multilingual and reasoning-focused benchmarks:

  • Instruction following: IFEval
  • General knowledge: HellaSwag, MMLU, MMLU-ProX, BBH
  • STEM: ARC-C, GPQA◇, GSM8K, MATH-500, HumanEval
  • Multilingual: mHellaSwag, MMMLU, MMLU-ProX
  • Multilingual STEM: mARC-C, MGSM
  • Translation: FLORES-200, WMT24++, WMT25

English Results (Table 3):

Model IFEval HellaSwag MMLU MMLU-Pro ARC-C GSM8K MATH-500 HumanEval
EuroLLM-9B 62.4 53.0 65.5 42.3 85.9 74.6 36.9 50.8
EuroLLM-22B 67.2 69.7 69.8 50.8 89.8 85.5 54.5 53.9
Apertus-8B 59.1 58.1 57.3 32.7 75.5 67.7 26.9 39.0
Apertus-70B 61.2 74.6 67.9 41.9 84.7 80.0 42.3 44.5

Multilingual Results (Table 4) — Aggregated over 24 EU languages:

Model HellaSwag MMMLU ARC-C FLORES WMT24++
EuroLLM-9B 49.9 61.5 80.7 88.9 83.6
EuroLLM-22B 62.6 65.6 84.1 88.9 83.9
Apertus-8B 50.9 54.0 71.0 87.8 81.5
Apertus-70B 68.6 61.7 79.6 85.1 76.0

Ablation with the updated post-training (EuroBlocks-SFT-2512) yields consistent +5–10 point improvements in reasoning and instruction-following tasks, with translation quality unaffected.

6. Released Resources and Reproducibility

All models, training corpora, and evaluation code are released openly on HuggingFace under the “EuroLLM” collection. Specifically, releases include:

  • Base and instruction-tuned models (EuroLLM-22B, EuroLLM-22B-Instruct, improved 9B variants)
  • EuroWeb multilingual web pretraining data (three quality tiers)
  • EuroBlocks-SFT-2512 instruction data (approximately 10.6 million multilingual examples)
  • Megatron-LM fork for pretraining and the eurollm-eval evaluation suite

This emphasizes full reproducibility for the multilingual LLM community and supports extensibility to further European and non-European languages.

7. Significance and Impact

EuroLLM-22B establishes state-of-the-art performance among fully-open European multilingual LLMs at this scale, exceeding prior open models in general reasoning, STEM, and instruction-following benchmarks across English and 24 EU languages. The integration of a balanced BPE vocabulary, rigorous multi-phase data filtering, scale-efficient attention mechanisms, and long-context support addresses longstanding gaps in high-quality, openly accessible multilingual LLM resources for Europe. The openly released data, models, and codebases are explicitly designed to foster research and technology development serving the full linguistic diversity of Europe (Ramos et al., 5 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EuroLLM-22B.