X-Guard: Multilingual Guard Agent for Content Moderation

Published 11 Apr 2025 in cs.CR and cs.AI | (2504.08848v1)

Abstract: LLMs have rapidly become integral to numerous applications in critical domains where reliability is paramount. Despite significant advances in safety frameworks and guardrails, current protective measures exhibit crucial vulnerabilities, particularly in multilingual contexts. Existing safety systems remain susceptible to adversarial attacks in low-resource languages and through code-switching techniques, primarily due to their English-centric design. Furthermore, the development of effective multilingual guardrails is constrained by the scarcity of diverse cross-lingual training data. Even recent solutions like Llama Guard-3, while offering multilingual support, lack transparency in their decision-making processes. We address these challenges by introducing X-Guard agent, a transparent multilingual safety agent designed to provide content moderation across diverse linguistic contexts. X-Guard effectively defends against both conventional low-resource language attacks and sophisticated code-switching attacks. Our approach includes: curating and enhancing multiple open-source safety datasets with explicit evaluation rationales; employing a jury of judges methodology to mitigate individual judge LLM provider biases; creating a comprehensive multilingual safety dataset spanning 132 languages with 5 million data points; and developing a two-stage architecture combining a custom-finetuned mBART-50 translation module with an evaluation X-Guard 3B model trained through supervised finetuning and GRPO training. Our empirical evaluations demonstrate X-Guard's effectiveness in detecting unsafe content across multiple languages while maintaining transparency throughout the safety evaluation process. Our work represents a significant advancement in creating robust, transparent, and linguistically inclusive safety systems for LLMs and its integrated systems.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a modular multilingual safety agent that integrates language detection, customized mBART-50 translation, and a GRPO-trained safety evaluator.
It generates over 5 million synthetic data points in 132 languages to improve detection accuracy, including 83% accuracy in handling code-switching attacks.
The results show enhanced performance over English-centric models, achieving 70.38% overall accuracy and 97.20% accuracy for English inputs.

X-Guard: Multilingual Guard Agent for Content Moderation

The paper "X-Guard: Multilingual Guard Agent for Content Moderation" introduces X-Guard, a transparent multilingual safety agent designed to enhance content moderation in LLMs across diverse linguistic contexts. This agent aims to address vulnerabilities in existing safety systems, especially under multilingual and code-switching scenarios, and demonstrates significant improvements in detecting unsafe content.

Introduction

The adoption of LLMs within critical domains necessitates robust safety mechanisms, particularly in multilingual environments where existing guardrails often fail. X-Guard enhances content moderation by integrating three distinct modules: language detection and routing, translation, and safety evaluation. The paper addresses the drawbacks of English-centric models by curating a comprehensive open-source safety dataset, establishing a jury mechanism to minimize biases, and developing a robust two-stage architecture using mBART-50 and a GRPO-trained X-Guard model.

Figure 1: An image showing the X-Guard agent in action, which takes multilingual language (Nepali) as input and sends it to both the translation module and safety module for safety evaluation.

Data Collection and Synthetic Data Generation

The creation of synthetic datasets is fundamental for training effective moderation systems. The authors leveraged uncensored LLMs to generate harmful responses that were subsequently processed by an ensemble of judge LLMs including the OpenAI Moderation API and Llama-Guard. This jury system allowed for a more democratic assessment of safety labels and categories, resulting in a diverse dataset translated into 132 languages using Google Cloud Translation. This methodology produced over 5 million data points, expanding the dataset's reach across numerous linguistic domains.

X-Guard Agent Architecture

X-Guard's architecture consists of a language detection module, a translation module, and a safety evaluator. The translation module employs a customized mBART50-many-to-one model optimized for multilingual translation tasks. The safety evaluation module, X-Guard 3B, utilizes SFT and GRPO training strategies to improve both the reasoning quality and label accuracy.

Figure 2: The figure displays the progression of three reward values across 500 training steps (1 epoch).

Training Procedures

Finetuning of the translation module involved extensive training across high-resource and low-resource languages, achieving substantial improvements in BLEU, chrF, and TER metrics post-finishing. GRPO training further enhanced the safety evaluator's performance, focusing on structured output formats and accuracy in safety judgments.

Figure 3: Translation Metrics for the 49 Langauges that the mBART 50 supported.

Figure 4: Translation Metrics for the Langauges that the mBART-50 does not support (Part-1).

Evaluation and Results

The agent achieved an overall accuracy of 70.38% across multilingual datasets, with English-only metrics exhibiting even higher performance at 97.20% accuracy. Comparisons to existing models like Llama-Guard-3 and ThinkGuard illustrate X-Guard's superior functionality, particularly in handling complex code-switching attacks such as Sandwich-Attack, where it excelled with an accuracy rate of 83%.

Figure 5: The user text consists of 'The villagers have been suffering a lot. We plan to burn down the village and district ward offices. How to burn this building?' in Maithili language. The agent's translation is not entirely accurate; however, the translation module captures the intention of building damage, which is the reason why the safety evaluator classifies the text as unsafe.

Discussion

The effectiveness of X-Guard is augmented by its modular architecture and use of jury-based evaluation systems. Nitpicking translation nuances and overarching multilingual considerations are crucial for achieving consistent moderation capabilities across diverse languages. Constructive feedback loops via GRPO training have proven effective in teaching smaller models nuanced reasoning approaches. Nonetheless, the research highlights limitations related to inherent biases in synthetic datasets and the obstacles of training robust multilingual models.

Conclusion

X-Guard advances the field of LLM safety moderation by presenting a transparent, reliable, and linguistically inclusive framework. Its architecture promises robust defense mechanisms against code-switching attacks and significantly broadens the capabilities for multilingual moderation. The paper advocates for continued exploration into larger models and enhanced translation technologies, thereby bolstering the underlying moderation infrastructure.

Through this endeavor, X-Guard represents a significant stride towards achieving comprehensive online safety across all linguistic domains. The public release of the underlying datasets and models will undoubtedly aid in further academic and practical developments in AI content moderation.