EuroBlocks Instruction Datasets
- EuroBlocks instruction datasets are large-scale, multilingual supervised fine-tuning corpora offering 10.6 million instruction-response pairs across general, coding, mathematical, and STEM domains.
- The dataset employs automated filtering, reward model scoring, and deduplication to ensure high-quality, diverse instruction-fidelity across all 24 EU and key non-EU languages.
- When applied in tuning models like EuroLLM-22B, EuroBlocks delivers significant performance gains, including up to +10 pp improvements on instruction following, STEM, and multilingual benchmarks.
EuroBlocks instruction datasets are large-scale, multilingual supervised fine-tuning (SFT) corpora developed for use in post-training instruction tuning of LLMs, with explicit design to cover all 24 official European Union (EU) languages and a set of strategically selected non-EU languages. They serve as the centerpiece of the EuroLLM-22B instruct-tuning pipeline, targeting diverse, high-quality instruction–response data across general, coding, mathematical, and STEM domains. These datasets underpin improvements in cross-lingual instruction following and domain robustness, with substantial quantitative impacts on benchmarks relevant to reasoning, translation, and STEM performance (Ramos et al., 5 Feb 2026).
1. Motivation and Design Goals
The EuroBlocks datasets address structural underrepresentation of European languages and highly multilingual contexts in existing open LLM instruction corpora. Their primary objective is to construct a high-quality, broad-coverage instruction–response pairing resource that balances linguistic diversity, domain breadth, and instructional fidelity. Specific aims include:
- Comprehensive coverage of the 24 official EU languages, ensuring no language is left in the “long tail.”
- Supplementation with additional globally significant non-EU languages to enhance generalizability and functional reach.
- Multidomain coverage, balancing general conversational data with targeted code, mathematical, and STEM task instructions.
- Maximization of instructional utility through automated filtering and quality selection techniques.
2. Dataset Composition and Language Distribution
The released EuroBlocks-SFT-2512 dataset comprises approximately 10.6 million instruction–response pairs. Its composition distributes by category as follows:
- ~60% English “general” assistant conversation and Q&A.
- ~20% non-English multilingual general instruction pairs.
- ~20% code/math/STEM-focused examples.
Language distribution within the “general” instruction subset highlights both broad base and intentional diversification: Chinese (10.24%), Spanish (9.40%), French (9.26%), German (8.37%), Italian (8.03%), with the remaining languages covering a long statistical tail that includes EU official languages and thirteen globally important others. The full language set is detailed in Table 1.
| Category | Languages | Examples (%) |
|---|---|---|
| Official EU | Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish | All covered |
| Non-EU/key others | Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, Ukrainian | 13 additional covered |
| Largest “general” | Chinese, Spanish, French, German, Italian | 10.24% – 8.03% per |
3. Data Sources and Curation Pipeline
EuroBlocks are built from a blend of carefully curated public instruction sources and high-quality synthetic data:
- Public instruction datasets: OpenHermes-2.5 (Teknium), Aya (Singh et al. 2024), Magpie (Xu et al. 2024), Hermes-3 (Teknium 2024), Tülu 3 (Lambert et al. 2025), Nemotron-V1 and V2 (NVIDIA).
- Synthetic data: Model-generated responses via a suite of strong open LLMs (DeepSeek-V3, Qwen2.5-Math, Qwen2.5-Instruct, Llama3.1-70B, Gemma2-27B, and others), with each instruction prompt receiving multiple candidate answers.
- Reward model filtering: Skywork-Gemma2-27B serves as the reward model scoring (prompt + answer) for automated selection of the highest-quality response.
- Large-scale STEM expansions: Two million STEM questions from Nemotron-V1 and additional math questions from Qwen2.5-Math, automatically judged and filtered by Qwen2.5-32B.
No further human annotation or manual validation is applied post-sampling; instead, quality control is delegated to reward-model scoring and deduplication algorithms.
4. Dataset Structure, Formatting, and Deduplication
The EuroBlocks dataset is released as a JSON-lines file suitable for efficient ingestion and streaming. Each record minimally contains:
instruction: User prompt or task description (string).response: Automatically selected model-generated answer (string).lang: (Optional) ISO 639-1 language code of the instruction.
Example:
1 2 3 4 5 |
{
"instruction": "Summarize the causes of World War I in three bullet points.",
"response": "- The assassination of Archduke Franz Ferdinand.\n- Rising nationalism and militarism among European powers.\n- Complex web of alliances drawing countries into conflict.",
"lang": "en"
} |
- Explicit removal of chain-of-thought or reasoning traces from both prompt and response; only final-answer style pairs are retained.
- Instruction-level deduplication to ensure unique prompts.
- Automatic discard of samples with formatting errors or missing fields.
5. Usage in Instruction Tuning and Benchmark Impact
EuroBlocks-SFT-2512 is central to instruction tuning in EuroLLM-22B. The fine-tuning configuration includes a 32,768-token context window, mixed-precision (bfloat16) training, sequence packing, cosine learning-rate scheduling (peak , 125 warmup steps), and loss computed only on the target tokens. Axolotl and Liger-Kernel codebases enable efficient kernel utilization (RoPE, RMSNorm, SwiGLU, fused linear layers).
Empirical ablation studies show that models instruction-tuned with EuroBlocks exhibit:
- +8–10 percentage point (pp) gain on IFEval (English instruction following).
- +5–10 pp improvement on MMLU and STEM-specific benchmarks.
- +5–7 pp improvement on multilingual knowledge and STEM tasks relative to earlier EuroLLM instruction checkpoints (Ramos et al., 5 Feb 2026).
6. Automatic Translation Approaches for Instruction Datasets
Instruction datasets for underrepresented languages may be extended or constructed using advanced translation frameworks, notably InstaTrans (INSTruction-Aware TRANSlation) (Kim et al., 2024). InstaTrans utilizes a two-phase process:
- GPT-4–powered, prompt-engineered seed translation with function-calling, ensuring full preservation of instructional content and strict separation of translation from answer generation.
- Further fine-tuning of strong open-source instruction-tuned LLMs (≥7B parameters) on the seed set, optimizing for “completeness” () and “instruction-awareness” (informativeness ), as measured by automatic GPT-4 scoring.
This approach mitigates tail phenomena by scaling high-quality target language data, enforcing token-level completeness, and sustaining fine-grained instruction attribute fidelity. Quantitative evaluation shows that fine-tuned translators via InstaTrans achieve BLEU, COMET, and GEMBA metrics within 2–4 points of GPT-4 translations for Korean, and gains of +5–10 BLEU and +5–8 COMET over commercial baselines for European languages, with an instruction-awareness ratio exceeding 90%.
7. Significance and Broader Context
EuroBlocks instruction datasets set a precedent in multilingual LLM supervision by prioritizing both the breadth of linguistic coverage and the vertical depth of task types. They enable robust, competitive performance on multilingual benchmarks and specialized STEM/knowledge tasks, validating the use of high-quality, reward-filtered synthetic instruction generation at unprecedented scale. The dataset architecture and curation choices illustrate a broader trend toward automated, scoring-based dataset pipelines that minimize manual annotation, relying instead on alignment models and large-scale open-source generation capabilities.
The integration of automated translation frameworks like InstaTrans further extends the methodology for rapid, instruction-aware dataset creation in additional languages, reinforcing the utility and adaptability of the EuroBlocks model and laying groundwork for the expansion of high-performance LLMs to all linguistic domains represented within and beyond the EU (Kim et al., 2024, Ramos et al., 5 Feb 2026).