Papers
Topics
Authors
Recent
Search
2000 character limit reached

EuroBlocks Instruction Datasets

Updated 6 February 2026
  • EuroBlocks instruction datasets are large-scale, multilingual supervised fine-tuning corpora offering 10.6 million instruction-response pairs across general, coding, mathematical, and STEM domains.
  • The dataset employs automated filtering, reward model scoring, and deduplication to ensure high-quality, diverse instruction-fidelity across all 24 EU and key non-EU languages.
  • When applied in tuning models like EuroLLM-22B, EuroBlocks delivers significant performance gains, including up to +10 pp improvements on instruction following, STEM, and multilingual benchmarks.

EuroBlocks instruction datasets are large-scale, multilingual supervised fine-tuning (SFT) corpora developed for use in post-training instruction tuning of LLMs, with explicit design to cover all 24 official European Union (EU) languages and a set of strategically selected non-EU languages. They serve as the centerpiece of the EuroLLM-22B instruct-tuning pipeline, targeting diverse, high-quality instruction–response data across general, coding, mathematical, and STEM domains. These datasets underpin improvements in cross-lingual instruction following and domain robustness, with substantial quantitative impacts on benchmarks relevant to reasoning, translation, and STEM performance (Ramos et al., 5 Feb 2026).

1. Motivation and Design Goals

The EuroBlocks datasets address structural underrepresentation of European languages and highly multilingual contexts in existing open LLM instruction corpora. Their primary objective is to construct a high-quality, broad-coverage instruction–response pairing resource that balances linguistic diversity, domain breadth, and instructional fidelity. Specific aims include:

  • Comprehensive coverage of the 24 official EU languages, ensuring no language is left in the “long tail.”
  • Supplementation with additional globally significant non-EU languages to enhance generalizability and functional reach.
  • Multidomain coverage, balancing general conversational data with targeted code, mathematical, and STEM task instructions.
  • Maximization of instructional utility through automated filtering and quality selection techniques.

2. Dataset Composition and Language Distribution

The released EuroBlocks-SFT-2512 dataset comprises approximately 10.6 million instruction–response pairs. Its composition distributes by category as follows:

  • ~60% English “general” assistant conversation and Q&A.
  • ~20% non-English multilingual general instruction pairs.
  • ~20% code/math/STEM-focused examples.

Language distribution within the “general” instruction subset highlights both broad base and intentional diversification: Chinese (10.24%), Spanish (9.40%), French (9.26%), German (8.37%), Italian (8.03%), with the remaining languages covering a long statistical tail that includes EU official languages and thirteen globally important others. The full language set is detailed in Table 1.

Category Languages Examples (%)
Official EU Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish All covered
Non-EU/key others Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, Ukrainian 13 additional covered
Largest “general” Chinese, Spanish, French, German, Italian 10.24% – 8.03% per

3. Data Sources and Curation Pipeline

EuroBlocks are built from a blend of carefully curated public instruction sources and high-quality synthetic data:

  • Public instruction datasets: OpenHermes-2.5 (Teknium), Aya (Singh et al. 2024), Magpie (Xu et al. 2024), Hermes-3 (Teknium 2024), Tülu 3 (Lambert et al. 2025), Nemotron-V1 and V2 (NVIDIA).
  • Synthetic data: Model-generated responses via a suite of strong open LLMs (DeepSeek-V3, Qwen2.5-Math, Qwen2.5-Instruct, Llama3.1-70B, Gemma2-27B, and others), with each instruction prompt receiving multiple candidate answers.
  • Reward model filtering: Skywork-Gemma2-27B serves as the reward model scoring (prompt + answer) for automated selection of the highest-quality response.
  • Large-scale STEM expansions: Two million STEM questions from Nemotron-V1 and additional math questions from Qwen2.5-Math, automatically judged and filtered by Qwen2.5-32B.

No further human annotation or manual validation is applied post-sampling; instead, quality control is delegated to reward-model scoring and deduplication algorithms.

4. Dataset Structure, Formatting, and Deduplication

The EuroBlocks dataset is released as a JSON-lines file suitable for efficient ingestion and streaming. Each record minimally contains:

  • instruction: User prompt or task description (string).
  • response: Automatically selected model-generated answer (string).
  • lang: (Optional) ISO 639-1 language code of the instruction.

Example:

1
2
3
4
5
{
  "instruction": "Summarize the causes of World War I in three bullet points.",
  "response": "- The assassination of Archduke Franz Ferdinand.\n- Rising nationalism and militarism among European powers.\n- Complex web of alliances drawing countries into conflict.",
  "lang": "en"
}
The cleaning and filtering procedures:

  • Explicit removal of chain-of-thought or reasoning traces from both prompt and response; only final-answer style pairs are retained.
  • Instruction-level deduplication to ensure unique prompts.
  • Automatic discard of samples with formatting errors or missing fields.

5. Usage in Instruction Tuning and Benchmark Impact

EuroBlocks-SFT-2512 is central to instruction tuning in EuroLLM-22B. The fine-tuning configuration includes a 32,768-token context window, mixed-precision (bfloat16) training, sequence packing, cosine learning-rate scheduling (peak lr=1×105\mathrm{lr} = 1 \times 10^{-5}, 125 warmup steps), and loss computed only on the target tokens. Axolotl and Liger-Kernel codebases enable efficient kernel utilization (RoPE, RMSNorm, SwiGLU, fused linear layers).

Empirical ablation studies show that models instruction-tuned with EuroBlocks exhibit:

  • +8–10 percentage point (pp) gain on IFEval (English instruction following).
  • +5–10 pp improvement on MMLU and STEM-specific benchmarks.
  • +5–7 pp improvement on multilingual knowledge and STEM tasks relative to earlier EuroLLM instruction checkpoints (Ramos et al., 5 Feb 2026).

6. Automatic Translation Approaches for Instruction Datasets

Instruction datasets for underrepresented languages may be extended or constructed using advanced translation frameworks, notably InstaTrans (INSTruction-Aware TRANSlation) (Kim et al., 2024). InstaTrans utilizes a two-phase process:

  1. GPT-4–powered, prompt-engineered seed translation with function-calling, ensuring full preservation of instructional content and strict separation of translation from answer generation.
  2. Further fine-tuning of strong open-source instruction-tuned LLMs (≥7B parameters) on the seed set, optimizing for “completeness” (CC) and “instruction-awareness” (informativeness II), as measured by automatic GPT-4 scoring.

This approach mitigates tail phenomena by scaling high-quality target language data, enforcing token-level completeness, and sustaining fine-grained instruction attribute fidelity. Quantitative evaluation shows that fine-tuned translators via InstaTrans achieve BLEU, COMET, and GEMBA metrics within 2–4 points of GPT-4 translations for Korean, and gains of +5–10 BLEU and +5–8 COMET over commercial baselines for European languages, with an instruction-awareness ratio exceeding 90%.

7. Significance and Broader Context

EuroBlocks instruction datasets set a precedent in multilingual LLM supervision by prioritizing both the breadth of linguistic coverage and the vertical depth of task types. They enable robust, competitive performance on multilingual benchmarks and specialized STEM/knowledge tasks, validating the use of high-quality, reward-filtered synthetic instruction generation at unprecedented scale. The dataset architecture and curation choices illustrate a broader trend toward automated, scoring-based dataset pipelines that minimize manual annotation, relying instead on alignment models and large-scale open-source generation capabilities.

The integration of automated translation frameworks like InstaTrans further extends the methodology for rapid, instruction-aware dataset creation in additional languages, reinforcing the utility and adaptability of the EuroBlocks model and laying groundwork for the expansion of high-performance LLMs to all linguistic domains represented within and beyond the EU (Kim et al., 2024, Ramos et al., 5 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EuroBlocks Instruction Datasets.