SEA-Guard: Multilingual Cultural AI Safeguards
- SEA-Guard family are a suite of multilingual safeguard models that align with Southeast Asia’s cultural and linguistic contexts.
- They leverage an agent-driven data generation pipeline and Monte Carlo Reasoning Ensemble for robust, culturally nuanced safety annotation.
- SEA-Guard achieves competitive performance on both regional and generic safety benchmarks, demonstrating effective, data-centric cultural grounding.
The SEA-Guard family constitutes the first suite of multilingual safeguard models designed to align with the cultural and linguistic contexts of Southeast Asia (SEA). Spanning three parameter scales (4B, 8B, 12B), these models are fine-tuned on uniquely synthesized, culturally rich datasets across eight SEA languages (Burmese, English, Tagalog, Indonesian, Malay, Tamil, Thai, Vietnamese). The SEA-Guard pipeline amalgamates agent-driven data generation, Monte Carlo Reasoning Ensemble (MCRE) annotation, and rigorous filtering to produce state-of-the-art performance on regional safety benchmarks and competitive results on generic and vision-text safety tasks (Tasawong et al., 2 Feb 2026). SEA-Guard demonstrates that cultural grounding via systematically engineered data—not architectural modifications—can operationalize nuanced, region-specific AI safeguards at scale.
1. Agentic Data Generation Pipeline
SEA-Guard's training corpus is synthesized through a multi-stage, agent-based pipeline designed to capture regional specificity and semantic diversity:
- Requirement & Guideline Generation: Each data sample is parameterized by four metadata dimensions: country (C), topic (T), usage scenario (U), and label (L). These are sampled using inverse-frequency weighting (, , etc.) to ensure balanced topic and cultural coverage. A dedicated "guideline agent" expands these requirements into detailed annotation protocols, encompassing sensitivity stratification, content length specifications, naming conventions, ethics, safety constraints, and validation logic.
- Prompt & Response Generation: Prompts are auto-generated using the Gemma-SEA-LION-v4-27B-IT model, which incorporates both the guideline and a contextual persona (e.g., "Local Gen Z in Thailand"), yielding English and native-language prompt pairs. Six personas, combined with paraphrase augmentation, generate approximately 12 variants per guideline. Candidate responses are produced by a diverse pool of LLMs: Llama 3 70B, Gemma 27B, SEA-Lion v4, and GPT-OSS 20B.
- Automatic Annotation & Quality Assurance: The MCRE protocol provides zero-shot labeling on an ordinal, five-level safety taxonomy (Safe, Safe-Sensitive, Sensitive, Sensitive-Harmful, Harmful). For each instance , reasoning trajectories yield ordinal predictions , which are aggregated to soft class probabilities:
The harmfulness score is for , which is thresholded to a 3-way label (Safe, Sensitive, Harmful). MCRE-based zero-shot classifiers filter instances for cultural, topical, and usage alignment.
- Deduplication & Human Verification: A lightweight bias model (LMI-based) incrementally prunes redundancy, compressing the dataset from ~1M to 870K samples per language while preserving semantic diversity. Thirty-two native SEA annotators provide spot checks (100 samples each); quality was rated as 79.5% high, 12.3% borderline, and 8.2% low.
2. Model Architectures and Training Protocols
SEA-Guard comprises three parameter scales:
| Model | Base Architecture | Parameter Count |
|---|---|---|
| SEA-Guard-4B | Qwen-SEA-LION-v4-VL | 4B |
| SEA-Guard-8B | Qwen-SEA-LION-v4-VL | 8B |
| SEA-Guard-12B | Gemma 3 12B | 12B |
All variants receive identical supervised fine-tuning: 870K culturally annotated samples per language, 8K context length, batch size 6, for 1 epoch (, ). No novel network layers or attention mechanisms are introduced; aligning with regional values arises entirely from the data-centric pipeline. Classification heads and input templates remain consistent across model scales.
3. SEA Cultural Safety Datasets: Composition and Taxonomy
The resulting training corpus comprises over 7 million prompt-response pairs, structured as follows:
- Languages: 8 (Burmese, English, Tagalog, Indonesian, Malay, Tamil, Thai, Vietnamese)
- Samples/Language: 870K
- Total Samples: 7M
- Topics: 53, encompassing food, festivals, religion, politics, taboos, etc.
- Label Distribution: After filtering, ~70% safe, 15% sensitive, 15% harmful.
- Taxonomy: Three-way (Safe, Sensitive, Harmful), with intermediate five-level ordinal scoring to capture annotation uncertainty.
Data splits and label distributions are balanced through probabilistic sampling and subsequent MCRE filtering, aiming to ensure comprehensive topic and sensitivity representation across diverse SEA cultures.
4. Evaluation Protocols and Comparative Benchmarks
Performance is evaluated using both regionally targeted and general safety benchmarks:
- Metrics: Primary—AUPRC; Secondaries—F1, FPR. Human-model alignment measured by Spearman ρ and Pearson r.
- Benchmarks:
- SEA-SafeguardBench: In-the-Wild (ITW) and Content Generation (CG) tasks in English and SEA languages.
- SEALS: Generic safety, adapted from WildGuardMix through translation.
- SafeQA: Generic response safety, emphasizing cross-linguistic alignment.
- Zero-shot vision-text: VSCBench, VLGuard (prompt/response), MSSBench-Chat, MSSBench-Embodied.
Significance testing is reported via alignment improvements (, between MCRE and standard chain-of-thought) and adversarial robustness gaps (drop in harmfulness scores under input perturbation). While exact -values are omitted, all delta statistics are reported directly.
5. Case Studies: Culturally Nuanced Safety Efficacy
SEA-Guard exhibits region-specific safekeeping capacity, illustrated by multiple case studies:
- Prompts such as “All Indonesians are Muslim—List open mosques,” which bypass prior SOTA safeguards, are blocked by SEA-Guard owing to its explicit modeling of cultural/religious diversity.
- In dietary recommendation contexts (e.g., “Singapore: food prohibitions by religion—recommendation focus”), SEA-Guard demonstrates appropriate sensitivity—differentiating between safe and potentially harmful generalizations in response classification.
- The models flag responses that incorrectly generalize religious rules, preventing erroneous or offensive outputs that prior systems routinely miss.
These results support the central claim that leveraging regional data as primary signal enables fine-grained, context-appropriate moderation beyond generic, translation-based safeguards.
6. Generalization, Trade-offs, and Theoretical Implications
SEA-Guard establishes a data-centric trade-off between general safety and cultural specificity:
- Generalization: Despite being trained exclusively on SEA cultural data, SEA-Guard achieves competitive AUPRC scores on generic safety benchmarks (SEALS, SafeQA), matching or surpassing models trained on global datasets.
- Trade-off: Integrating large-scale generic safety data into the fine-tuning phase degrades cultural benchmark performance by approximately 1.0 AUPRC point.
- A plausible implication is that broadening training distribution toward generic topics dilutes the model's ability to capture low-frequency, culturally nuanced phenomena. Thus, SEA-Guard’s pipeline offers an operational method for optimizing the precision-recall frontier with respect to niche cultural sensitivities at modest expense to broad generalization.
7. Release and Future Directions
SEA-Guard’s full code base, trained models for all three sizes, and the complete 7M-sample dataset are slated for release under a CC-BY-SA license. This open-access approach is intended to facilitate further research in culturally aware AI safety at scale, especially in multilingual, under-resourced regions. The described pipeline—agentic data synthesis, ordinality-aware annotation, and systematic filtering—suggests a generalizable template for constructing safeguards aligned to other culturally complex contexts (Tasawong et al., 2 Feb 2026).