Aya-Expanse 8B: Scalable Multilingual LLM
- Aya-Expanse 8B is a multilingual, decoder-only Transformer model with 8 billion parameters and 32 layers, setting new benchmarks across 23 languages.
- It integrates innovations such as SwiGLU activation, RoPE embeddings, and FlashAttention to optimize memory efficiency and boost inference performance.
- The model employs advanced pretraining, instruction fine-tuning, and post-training techniques like data arbitrage, DPO, and model merging to enhance multilingual adaptation and safety.
Aya-Expanse 8B is a multilingual, decoder-only Transformer-based LLM that has established new benchmarks for performance and versatility across diverse natural language processing tasks in more than 23 languages. It integrates architectural innovations, advanced multilingual pretraining, preference optimization, and model merging strategies to surpass contemporaneous models in its parameter class. The model serves as a foundation for multilingual research in safety, fairness, compression, and adaptation across translation, grammatical error correction, and knowledge unlearning.
1. Model Architecture
Aya-Expanse 8B is instantiated as a decoder-only Transformer network comprising approximately 8 billion parameters, with 32 transformer layers each containing 32 self-attention heads (Dang et al., 2024, Farashah et al., 9 Jan 2026, Moslem et al., 26 Oct 2025). The hidden dimension is 6,144, and the feed-forward inner dimension is 24,576. SwiGLU activation and grouped-query attention are employed for parameter efficiency and memory reduction at long context windows. Rotary position embeddings (RoPE) are integrated for context extrapolation (Dang et al., 2024). Vocabulary size is approximately 128,000 subword tokens via SentencePiece tokenization, supporting multilingual coverage.
Layer normalization is applied in every block. The architecture incorporates FlashAttention for efficient inference (Kovalchuk et al., 18 Sep 2025). Model weights utilize FP16/bfloat16 precision, and the model fits into ~32 GB of GPU memory for inference at an 8K-token context length.
2. Pretraining and Instruction Fine-Tuning
Aya-Expanse 8B is pretrained on a balanced mixture of multilingual corpora, including CommonCrawl-derived datasets (mC4, CC100), Wikipedia, filtered web text, GitHub code, books, and regional news corpora. Coverage spans over 100 languages during initial pretraining, with special attention to low-resource languages via multilingual arbitrage—oversampling and reward-based selection from teacher model pools (Dang et al., 2024).
Instruction fine-tuning leverages the Aya Dataset, encompassing over 1 million prompt–response pairs across more than 50 languages, for diverse instruction types such as QA, summarization, and reasoning (Farashah et al., 9 Jan 2026). AdamW optimizer is used, with a peak learning rate of 1e–4 for pretraining and 2e–5 for fine-tuning, batch sizes of 1,024 at sequence length 512, and training distributed over hundreds of A100/TPUv4 nodes using mixed precision.
3. Post-Training Techniques: Arbitrage, Preference Optimization, and Model Merging
Aya-Expanse employs three post-training innovations for multilingual robustness (Dang et al., 2024):
- Multilingual Data Arbitrage: Uses multiple “teacher” models to generate candidate completions per language; an internal reward model selects the most plausible for each prompt, thus improving synthetic data quality and supporting low-resource languages.
- Iterative Multilingual Preference Training: Direct Preference Optimization (DPO) aligns model outputs to human-rating preferences in multiple languages via repeated offline and online iterations. This involves scalar reward heads and temperature-tuned sigmoidal preference losses.
- Model Merging: Weighted interpolation of distinct model checkpoints—across different language clusters—enhances robustness and consistency. Linear weight averaging outperforms more complex strategies, and merging is scheduled after arbitrage and each DPO iteration.
Each innovation demonstrated incremental improvements in multilingual evaluation: arbitrage (+9.1pp win-rate), merging (+5.1pp), and iterative DPO (+7.1pp) (Dang et al., 2024).
4. Evaluation and Benchmarking
Aya-Expanse 8B is evaluated on the m-ArenaHard dataset, which translates Arena-Hard-Auto prompts into 23 languages for challenging instruction-following and knowledge tasks (Dang et al., 2024). Pairwise win rates are assessed via GPT-4o as the judge in head-to-head comparisons with major models:
| Comparison | Wins (%) | Losses (%) | Ties (%) |
|---|---|---|---|
| Aya 8B vs Llama 3.1 8B | 70.6 | 27.7 | 1.7 |
| Aya 8B vs Gemma 2 9B | 60.4 | 39.1 | 0.5 |
| Aya 8B vs Mistral 8B | 55.7 | 42.7 | 1.6 |
| Aya 8B vs Qwen 2.5 7B | 42.7 | 35.0 | 22.3 |
On academic benchmarks (discriminative tasks, MMLU, MGSM math, machine translation via FLORES-200), Aya-Expanse 8B matches or outperforms peers in the 8B parameter class (Dang et al., 2024).
5. Efficient Adaptation and Compression
Aya-Expanse 8B supports efficient domain adaptation and model compression. In multilingual GEC tasks (OmniGEC corpus, 11 languages), LoRA adapters with rank=8 and α=16 fine-tune only attention/FFN projections and achieve state-of-the-art GLEU scores—especially in low-resource languages (e.g., Estonian GLEU minimal +8.25 over baseline) (Kovalchuk et al., 18 Sep 2025).
For translation tasks (Czech→German, English→Egyptian Arabic), iterative layer pruning via leave-one-out chrF++ degradation enables reduction from 8.03B to 6.28B parameters with minimal COMET quality loss (≥98%) and a 28% speedup. Unexpectedly, pruning down to 16 layers (4.54B) outperformed the baseline for Eng→ARZ post strong fine-tuning (Moslem et al., 26 Oct 2025). Layer pruning, combined with vLLM inference and 4-bit quantization, further optimizes resource usage and throughput.
6. Safety, Fairness, and Unlearning in Multilingual Contexts
Aya-Expanse 8B is a focus for research on data and concept unlearning in multilingual settings (Farashah et al., 9 Jan 2026). Unlearning objectives include GradDiff (gradient difference), GradDiff-KL (adding a KL regularizer), and NPO (negative preference optimization) to erase specific knowledge or stereotypes from model responses.
Experiments demonstrate:
- Unlearning is most effective locally (the applied language); cross-lingual leakage is limited but asymmetric (e.g., English unlearning leaks more into Russian vs. the reverse).
- Syntactic similarity is the key predictor for cross-language transfer in forgetting (r≈0.35–0.40).
- NPO best isolates and forgets, with minimal spillover, and retains overall utility (perplexity, truth ratio).
- A single-language unlearning pass does not reliably purge sensitive data or biases from other languages, implying the necessity for joint, multilingual unlearning, especially for GDPR compliance and fairness.
- Japanese exhibits disproportionate sensitivity in concept unlearning spillover, indicated by increased neutral answers and perplexity jumps.
7. Practical Deployment and Limitations
Aya-Expanse 8B supports inference at up to 8K tokens, with memory requirements of ~32 GB for weights plus ~10–15 GB for activations. It is available as open weights for further research and quantization. In real-world deployment, context window size, quantization, and layer-pruning strategies affect both utility and efficiency. LoRA adaptation is recommended for resource-constrained fine-tuning.
Model performance may degrade in languages unsupported during pretraining unless supplemented by high-quality synthetic or silver-standard data. The effectiveness of pruning and adaptation varies by language resource level, domain specificity, and benchmark. Future directions include parameter-efficient unlearning, improved cross-lingual fairness methods, and nuanced language adaptation objectives (Farashah et al., 9 Jan 2026).
Aya-Expanse 8B combines scalable multilingual language modeling with state-of-the-art task performance, adaptive fine-tuning, efficient inference, and nuanced approaches to safety and fairness, providing a reference backbone for multilingual research and deployment across the global AI landscape (Dang et al., 2024, Farashah et al., 9 Jan 2026, Moslem et al., 26 Oct 2025, Kovalchuk et al., 18 Sep 2025).