UbuntuGuard: African LLM Safety Benchmark
- UbuntuGuard is a policy-based safety benchmark tailored for evaluating LLMs in low-resource African languages using culturally grounded expert policies.
- It leverages adversarial dialogue tasks to assess model compliance and reveal cultural misalignments inherent in Western-centric safety systems.
- The framework supports dynamic evaluation across English and African language settings, highlighting significant cross-lingual performance disparities.
UbuntuGuard is a policy-based safety benchmark and evaluation suite specifically engineered to address the limitations of Western-centric LLM guardian systems in the African linguistic and cultural context. Distinct from prior benchmarks that presuppose Western norms and high-resource language settings, it provides a unique framework for evaluating model safety across low-resource African languages via culturally grounded, locally derived policy rules and adversarial dialogue tasks (Abdullahi et al., 19 Jan 2026).
1. Motivations and Problem Landscape
Prevailing LLM guardian models and safety benchmarks have centered on high-resource languages (HRLs) and Western institutional norms. Such systems systematically underperform on low-resource African languages (LRLs) due to two central deficiencies:
- Cultural Misalignment: Western-centric categorizations—such as “hate,” “harassment,” or “medical advice”—do not generalize to local norms. Offense, taboo, and acceptable speech can vary by region, community, or domain, making rigid, globally imposed safety definitions inadequate.
- Data Scarcity and Cross-Lingual Transfer Failures: African languages remain under-represented in both red-teaming efforts and LLM safety training data. Cross-lingual transfer introduces semantic drift and fails to address code-switching or context-specific expressions of risk and harm.
UbuntuGuard is premised on the position that robust LLM safety in marginalized language settings requires policies authored or validated by local experts, dynamic enforcement at runtime, and benchmarks rooted in real-world adversarial interactions from the target communities (Abdullahi et al., 19 Jan 2026).
2. Dataset Construction and Structure
UbuntuGuard’s dataset is constructed from adversarial queries contributed by 155 domain experts—spanning physicians, educators, religious leaders, lawyers, and human-rights advocates—from six African countries (Ghana, Kenya, Malawi, Nigeria, South Africa, Uganda). The process unfolds as follows:
- Seed Generation: 8,091 adversarial queries authored in seven African source languages, spanning domains such as healthcare, politics, religion, finance, labor, education, and legal advice.
- Policy and Dialogue Synthesis: Using GPT-5 with structured metadata, 5–8 policy rules are generated for each query. Policy-driven dialogue pairs (“PASS” for compliant, “FAIL” for violating) with 3–5 turns are created using Llama-3.1-405B and Qwen3-235B-a22b.
- Multilingual Translation and Quality Filtering: All content is machine-translated (Google MT, October 2025) into ten African languages (Akan, Ewe, Hausa, Igbo, Luganda, Nyanja, Swahili, Xhosa, Yoruba, Zulu) and English. GEMBA-MQM filtering retains only those policy–dialogue pairs scoring ≥70% translation fidelity; selected languages undergo a final stage of human review to ensure preservation of cultural meaning.
Final splits include 1.2K–1.9K training samples per African language and a held-out test set of 2,307 high-quality policy–dialogue pairs. The dataset offers coverage across five themes—Misinformation, Public Interest, Stereotypes, Hate Speech, Expert Advice—and seven social domains (Abdullahi et al., 19 Jan 2026).
3. Policy Rule Derivation and Taxonomical Organization
Policies in UbuntuGuard consist of explicit, context-dependent rules framed as “If…”, “When…”, “Always…”, or “Never…” statements. For example:
- “If the user asks about vaccine side-effects, always cite credible bodies such as NPHCDA or WHO.”
- “Never present unverified herbal prescriptions as medically sanctioned.”
Each dialogue example is annotated with the specific policy rules implicated, enabling fine-grained evaluation by domain, theme (such as Misinformation or Hate Speech), and sensitive characteristic (such as Ethnicity, Religion, Gender). The taxonomy underpins the benchmark’s capacity to surface harm scenarios invisible to Western-centric safety metrics (Abdullahi et al., 19 Jan 2026).
4. Evaluation Protocols and Model Coverage
UbuntuGuard introduces three principal evaluation configurations:
- Static: Safety policies are hardcoded into a guardian model’s parameters or software. Example models include NemoGuard-8B and LlamaGuard-8B.
- Dynamic: Safety policies are provided at inference, allowing models to adapt rulesets on demand. For implementation, policies are prepended at inference time to the model prompt; the model classifies/generated responses for compliance accordingly. Example dynamic guardians include DynaGuard-8B, GPT-OSS Safeguard-20B/120B.
- Multilingual: Guardians statically trained on multiple non-African languages; examples include PolyGuard-7B, CultureGuard-8B.
Thirteen models spanning these regimes are benchmarked: seven guardian models (static, dynamic, and static-multilingual) and six general-purpose LLMs (including Qwen-3.1, Llama-3.1, DeepSeek-3.1) (Abdullahi et al., 19 Jan 2026).
5. Evaluation Methodology and Metrics
The safety assessment is cast as a binary classification task: each dialogue is labeled as PASS (compliant) or FAIL (policy-violating). Evaluation is conducted under three input/output scenarios:
- EN–EN: English dialogues, English policies (English baseline)
- LRL–EN: African-language dialogues, English policies (cross-lingual)
- LRL–LRL: African-language dialogues, African-language policies (full localization)
Metrics:
- Accuracy:
- Precision:
- Recall:
- F1 Score:
Here, TP = true positives (correct FAIL detections), TN = true negatives (correct PASS detections), FP = false positives (false alarm on safe content), and FN = false negatives (missed violations) (Abdullahi et al., 19 Jan 2026).
6. Empirical Results: Performance and Analysis
Key empirical findings reveal notable failures of cross-lingual safety when using English-centric benchmarks in African contexts:
- English Baseline (EN–EN): Most models exhibit high F1 (>95%). For example, Qwen-3.1 (8B) achieves 98.04; GPT-OSS Safeguard-20B, 97.26. Static English-only guardians underperform (LlamaGuard-8B: 50.22; NemoGuard-8B: 36.94).
- Cross-Lingual (LRL–EN): F1 drops 10–30 points across most models. Generalists such as DeepSeek-671B (83.14) outperform static guardians. DynaGuard-8B falls from 82.06 to 67.79. Swahili observes minimal error (~19%), while Ewe reaches >40% error.
- Full Localization (LRL–LRL): The sharpest performance drop occurs here. NemoGuard-8B collapses to F1=1.41; LlamaGuard-8B, 37.61. Static-multilingual models like CultureGuard drop from 86.76 to 67.00. Only large generalists maintain usable F1 (DeepSeek-671B: 80.59; GPT-OSS-20B: 75.36).
- Specialization Paradox: GPT-OSS Safeguard-20B, a safety-tuned smaller dynamic model, outperforms its untuned counterpart in LRL–LRL (+2.91 F1), but the effect reverses at larger scale (120B), where over-alignment to English degrades multilingual performance.
- Domain and Error Patterns: “Politics & Government” and “Culture & Religion” display the highest error rates (~6.5%), demonstrating the inherent need for nuanced, expert-localized policies (Abdullahi et al., 19 Jan 2026).
7. Conclusions, Limitations, and Directions
UbuntuGuard exposes critical gaps in current LLM safety paradigms: English-based evaluations systematically overstate safety performance in African languages; cross-lingual transfer is unreliable and incomplete; and dynamic policies, while partially mitigating these issues, remain insufficient for full cultural localization. Even state-of-the-art multilingual guardians fail to match the performance of generalists in the most challenging local settings.
Advancing equitable AI safety for LRLs requires:
- Authorship and annotation of culturally grounded safety policies by local experts,
- Widespread adoption of dynamic, runtime-enforceable guardrails sensitive to language and domain,
- Fine-tuning or co-training with gold-standard African-language policy–dialogue data,
- Broadening benchmarks to encompass more African languages, dialects, and local harm scenarios.
UbuntuGuard represents a methodological and operational advance, realigning AI safety evaluation toward scalable, equitable, and culturally aware practice in low-resource linguistic contexts (Abdullahi et al., 19 Jan 2026).