Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale

Published 17 Sep 2025 in cs.CL, cs.AI, and cs.LG | (2509.14008v1)

Abstract: We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR$\leftrightarrow$EN teacher to FP8 (yielding $\sim$2$\times$ higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight LLM LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the "nano" ($\leq$2B) and "small" (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a translate-and-tune pipeline that uses FP8 quantization to create high-fidelity Arabic instruction data from English sources.
It demonstrates significant performance gains across parameter scales, with models outperforming baseline metrics by up to +5.1 points on Arabic benchmarks.
The study highlights a resource-efficient training strategy, achieving robust model performance on a budget with 8×H100 GPUs and cost-effective translation.

Hala: Scalable Arabic-Centric Instruction and Translation Models

Motivation and Context

The Hala technical report addresses the persistent underrepresentation of Arabic in LLM research, particularly in instruction tuning and translation. While multilingual LLMs have achieved broad coverage, depth and cultural alignment for specific languages—especially morphologically rich and diglossic languages like Arabic—remain limited. The scarcity of high-quality Arabic instruction data constrains both the performance and scaling of Arabic-centric models. Hala proposes a language-centric approach, leveraging efficient translation pipelines and scalable fine-tuning strategies to construct and train robust Arabic instruction and translation models across a range of parameter scales.

Translate-and-Tune Pipeline

The core methodology is a translate-and-tune pipeline that systematically bootstraps high-fidelity Arabic instruction data from strong English sources. The process begins with the quantization of a high-capacity AR↔EN translator to FP8, yielding a $\sim$ 2 $\times$ increase in inference throughput without measurable quality loss. This quantized model is used to translate large-scale English instruction datasets into Arabic, forming the basis for subsequent instruction tuning.

Figure 2: Cross-lingual translation and fine-tuning pipeline for Liquid 1.2B, illustrating the teacher phase, bootstrapped translation, and final instruction-tuned model creation.

The pipeline consists of the following stages:

Teacher Phase: A strong AR↔EN model (CohereLabs/command-a-translate-08-2025) is quantized to FP8 using LLM Compressor, maintaining translation quality while doubling throughput.
Bootstrapped Translator Phase: The quantized teacher translates the Open-Orca dataset, which is then used to fine-tune a lightweight model (LiquidAI/LFM2-1.2B) for AR↔EN translation, specialized for instruction-style data.
Corpus Construction: The lightweight translator is used to translate multiple high-quality English instruction datasets (e.g., Hermes 3, SCP-116K, ReAlign-Alpaca, LaMini, Tulu 3, Synthetic Instruct-GPT-J Pairwise) into Arabic, yielding a corpus of approximately 4.5M instruction pairs.
Model Fine-Tuning and Merging: Arabic-centric models are fine-tuned at 350M, 700M, 1.2B, and 9B parameters. Spherical linear interpolation (slerp) merging is applied to balance Arabic specialization with base model generality.

Data Construction and Quality Control

A critical aspect of the pipeline is the construction of a high-quality bilingual AR↔EN corpus. The process involves:

Translating 405K instruction–response pairs from Open-Orca into Arabic using the FP8 translator, yielding paired bilingual tuples.
Augmenting with a filtered subset of OPUS-100, where a compact judge model (Qwen2.5-3B-Instruct) enforces strict fidelity via binary accept/reject prompts, resulting in 439,592 high-quality pairs from 1M candidates.
Combining these sources to fine-tune the lightweight AR↔EN translator, which is then used for large-scale dataset translation.

This approach ensures that the resulting Arabic instruction corpus maintains semantic fidelity and is well-aligned with the style and complexity of modern instruction-tuning datasets.

Model Training and Merging Strategies

Hala models are trained across four parameter scales (350M, 700M, 1.2B, 9B), using LiquidAI and FANAR architectures as bases. After fine-tuning on the translated Arabic instruction corpus, MergeKit is employed to merge the instruction-tuned checkpoints with their respective base models using slerp at $t=0.5$ . This merging strategy is empirically shown to preserve general capabilities while enhancing Arabic instruction-following performance.

Evaluation and Results

Hala models are evaluated on a suite of Arabic-centric benchmarks, including AlGhafa, AraTrust, ArabicMMLU, ArbMMLU-HT, EXAMS, and MadinahQA, using the LightEval framework with vLLM for efficient inference. The evaluation protocol aligns with the Open-Arabic-LLM-Leaderboard (OALL) task selection.

Key findings include:

Nano Regime ( $\leq$ 2B): Hala-1.2B achieves the highest average score in its class, outperforming its LiquidAI base by +5.1 points. Hala-350M and Hala-700M also show consistent improvements over their respective bases.
Small Regime (7B–9B): Hala-9B surpasses the previous state-of-the-art (FANAR-1-9B-Instruct) by +0.7 points on the average metric, while maintaining competitive scores on individual tasks.
Translation Quality: The FP8-quantized teacher matches or slightly exceeds the FP16 baseline on BLEU, ROUGE-L, and chrF++ for EN→AR translation of MMLU questions. The lightweight Hala LFM2-1.2B translator achieves a BLEU of 48.2 (vs. 16.0 for the base), ROUGE-L of 25.1 (+5.9), and chrF++ of 64.2 (+21.0), indicating substantial gains in translation fidelity for instruction-style data.

Resource Efficiency

All models were trained within a budget of \$1,000 on 8 $\times$ H100-SXM GPUs, with dataset translation performed on 12 $\times$ A100 GPUs at an additional cost of \$500. This demonstrates the feasibility of scaling language-centric models under constrained compute budgets.

Implications and Future Directions

The Hala report demonstrates that language-centric modeling, when combined with efficient translation pipelines and robust merging strategies, can yield state-of-the-art Arabic instruction and translation models across a range of parameter scales. The approach is particularly effective in low-resource and compute-constrained settings, providing a practical alternative to breadth-first multilingual scaling.

The open release of models, data, and training recipes is likely to catalyze further research in Arabic NLP, enabling reproducibility and downstream applications. The methodology is extensible to other underrepresented languages, provided that high-fidelity translation pipelines and quality control mechanisms are in place.

Future work may explore:

Enhanced dialectal coverage and cultural alignment through targeted data augmentation and reward modeling.
Integration of multimodal and domain-specific data to further expand the capabilities of Arabic-centric models.
Systematic analysis of merging strategies and their impact on cross-lingual transfer and generalization.

Conclusion

Hala establishes a scalable, efficient framework for building Arabic-centric instruction and translation models, achieving consistent improvements over strong baselines in both nano and small parameter regimes. The translate-and-tune pipeline, underpinned by FP8 quantization, robust data filtering, and slerp-based merging, provides a reproducible recipe for advancing language-centric LLMs. The results substantiate the claim that targeted, high-fidelity instruction tuning is a practical and effective complement to broad multilingual approaches, with significant implications for the development of sovereign AI technologies in underrepresented languages.

Markdown Report Issue