Papers
Topics
Authors
Recent
Search
2000 character limit reached

AfriqueLLM: Open LLMs for African Languages

Updated 17 January 2026
  • AfriqueLLM is a suite of open LLMs specifically adapted for African languages through continued pre-training on diverse, task-specific data mixtures.
  • The methodology leverages architectural adaptations and an optimized data mix—including native, code, math, and synthetic texts—to significantly enhance task performance.
  • Empirical evaluations on translation, reasoning, and classification benchmarks demonstrate notable performance gains, establishing AfriqueLLM as a reproducible reference for African NLP.

AfriqueLLM denotes a suite of open LLMs specifically adapted for African languages via continued pre-training (CPT) on curated, task-diverse data mixtures. The AfriqueLLM approach systematically investigates how data composition and model architecture impact downstream performance for African languages, addressing persistent underperformance of open multilingual LLMs relative to proprietary systems, especially on non-English, low-resource benchmarks. Recent research operationalizes AfriqueLLM as a reproducible family of models and methodology focused on 20–24 African and African-relevant high-resource languages, providing empirical, architectural, and data-centric insights for pan-African language technology development (Yu et al., 10 Jan 2026).

1. Architectural Foundations and Model Selection

AfriqueLLM models are built by adapting strong open-weight multilingual backbones via CPT. The suite encompasses:

  • Llama 3.1 (8B): Dense transformer with rotational positional embeddings and pronounced English/code inductive biases.
  • Gemma 3-4B and Gemma 3-12B: Dense transformers from Google, optimized for multilinguality across 60+ languages.
  • Qwen 3-8B and Qwen 3-14B: Dense (8B) and hybrid MoE (14B active) architectures with 119-language coverage and code/math priors.

Key architectural distinctions include dense versus MoE parametrization, variation in pretraining language coverage, differences in positional embedding schemes, and the use of attention kernel optimizations (e.g., FlashAttention, LIGER) (Yu et al., 10 Jan 2026).

Within a fixed family, larger models generally yield higher accuracy, but cross-family performance is dominated more by base-model inductive priors—particularly on code and reasoning—than simple scale increases. For example, AfriqueQwen-8B overtakes Gemma 3 12B in aggregate downstream accuracy at roughly half the parameter count.

2. Data Mixture Design and Pretraining Regimen

AfriqueLLM CPT is characterized by a meticulously engineered data mixture:

Dtotal=αDnative+βDmath+γDcode+δDsynthD_{\text{total}} = \alpha D_{\text{native}} + \beta D_{\text{math}} + \gamma D_{\text{code}} + \delta D_{\text{synth}}

Where:

  • DnativeD_{\text{native}} consists of 22.8B monolingual tokens in target African languages and four high-resource continental lingua francas (English, French, Portuguese, Arabic), upsampled with UniMax sampling to balance frequency (cap high-resource at 1B, upsample LRLs up to 5×).
  • DmathD_{\text{math}} (~1.07B): Educational mathematics from FineMath-4+.
  • DcodeD_{\text{code}} (~0.97B): Python code from CornStack-Python.
  • DsynthD_{\text{synth}} (~0.32B): Synthetic translated texts (10 web domains + math reasoning, generated from high-resource sources).
  • Optionally, ~0.456B tokens of parallel NLLB data are added in some mixtures for ablations.

Empirical optimization recommends coefficients α≈0.91\alpha\approx0.91, β≈0.042\beta\approx0.042, γ≈0.038\gamma\approx0.038, δ≈0.013\delta\approx0.013 for best generalization across translation and reasoning benchmarks (Yu et al., 10 Jan 2026). Data mixture ablations (monolingual-only vs. +code+math vs. +synthetic vs. +parallel) consistently reveal that code and math content are crucial for improving reasoning tasks, while synthetic translations further refine performance.

3. Training Configuration and Efficiency

CPT is conducted in high-throughput, large-batch synchronous settings:

  • Hardware: 16× NVIDIA H100 GPUs per run, DeepSpeed ZeRO-1/2, with FlashAttention 3 and sequence-packing for memory/performance.
  • Global batch size: 4M tokens/step.
  • Sequence length: 16k tokens (selected for reasoning and long-context utility; 4k too short, 32k offers diminishing returns).
  • Learning rate peak: 5×10−55\times 10^{-5}; scheduler: cosine decay with warmup.
  • Precision: fp16/bf16 mixed.
  • Epochs: 1–2 over monolingual, one epoch each for code, math, synthetics, and parallel.
  • Total token budget: ≈26B tokens (6–7k steps/run).

Best practices involve replaying modest amounts of HRL (high-resource language) data (~4B tokens each for English, French, etc.) to prevent HRL catastrophic forgetting, which otherwise degrades English-centric performance by 5–10% post-CPT in some architectures (Yu et al., 10 Jan 2026).

4. Evaluation Protocols and Benchmarks

AfriqueLLM models are systematically validated on the AfroBench-Lite suite, comprising:

  • AfriMGSM: 64-language 8-shot mathematical reasoning.
  • AfriMMLU: 64-language 5-shot multiple-choice knowledge QA.
  • AfriXNLI: Cross-lingual NLI.
  • Belebele RC: 122-variant reading comprehension.
  • Flores-200 MT: Many-to-many machine translation.
  • Injongo: 16-language intent classification.
  • SIB-200: 205-language topic classification.

Core metrics include accuracy, macro/micro F1, perplexity, chrF++, and SSA-COMET for translation (Yu et al., 10 Jan 2026).

5. Empirical Performance and Analytical Insights

AfriqueLLM demonstrates that CPT gains are primarily attributable to data mixture composition rather than raw parameter count or prior multilinguality. Key findings:

  • Monolingual-only CPT dramatically improves translation and NLI but fails on reasoning (AfriMGSM).
  • Addition of code and math recovers and enhances reasoning (AfriMGSM up to +78.8% rel. gain).
  • Synthetic translated data further boosts performance, especially in larger models (CMS mixture).
  • Larger models (≥12B) benefit less from noisy parallel data; high-quality synthetics are preferable.
  • CPT produces minimal gains on unseen languages (Afr-NPT)—benefits are largest on seen (Afr-PT) languages.
  • Qwen architectures, due to innate reasoning and code capacity, outperform comparably-sized Gemma or Llama models post-CPT by up to 15–35% on aggregate downstream metrics.

Ablation illustrates that base-model proficiency, especially in reasoning and code, is a more reliable predictor of post-CPT quality than surface-level multilingual coverage.

6. Deployment, Resources, and Reproducibility

AfriqueLLM checkpoints (4B–14B) along with associated data mixtures, code, and ablation studies are publicly released at Huggingface (Yu et al., 10 Jan 2026). The models support long-context inference (16k tokens) and demonstrate improvements for document-level translation and knowledge-intensive subtasks. Reproducibility is enhanced through open documentation of hardware, hyperparameters, and data mixture recipes.

7. Significance and Future Directions

AfriqueLLM establishes that CPT with optimized, task-diverse, and upsampled data mixtures can close the open-source/proprietary gap on African-language tasks. This approach achieves robust performance on translation, NLI, reading comprehension, intent/topic classification, and mathematical reasoning, without requiring fundamental model re-architecture or massively increased scale.

Future work should:

  • Extend to more under-represented African languages beyond current coverage.
  • Systematically explore structured synthetic data for further domain adaptation.
  • Combine CPT with parameter-efficient adaptation methods (e.g., adapters, LoRA) to maximize regional relevance at minimal computational cost.
  • Address benchmark coverage by supporting expansion of AfroBench-Lite and further language/task diversity.
  • Investigate coupling with active learning, high-resource language replay, and curriculum learning to mitigate catastrophic forgetting and maximize transfer to low-resource settings.

AfriqueLLM's release constitutes a reproducible, reference implementation of data- and architecture-centric multilingual CPT for African NLP, and provides technical specification and empirical baselines for subsequent research in African-centric language modeling (Yu et al., 10 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AfriqueLLM.