DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

Published 7 Feb 2025 in cs.CL and cs.LG | (2502.05163v1)

Abstract: The rapid advancement of LLMs has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at https://github.com/yihedeng9/DuoGuard.

Abstract PDF Upgrade to Chat

Summary

The paper introduces DuoGuard, a two-player adversarial RL framework where a generator and guardrail model co-evolve to create synthetic multilingual safety data for training LLM guardrails.
DuoGuard achieves a nearly 10% performance improvement over LlamaGuard3 (8B) on English benchmarks and offers significantly faster inference using a much smaller model (0.5B).
The framework formalizes the interaction as a two-player game, proves convergence to a Nash equilibrium, and connects its DPO objective to PPO for optimization.

This paper introduces a two-player Reinforcement Learning (RL) framework to address the scarcity of multilingual safety data for training LLM guardrails. The approach involves a generator and a guardrail model that co-evolve adversarially to produce synthetic data for multilingual guardrail training. The interaction between the generator and guardrail model is formalized as a two-player game, with a proof of convergence to a Nash equilibrium.

The paper claims the following:

The proposed model achieves a nearly 10\% improvement over LlamaGuard3 (8B) on English benchmarks.
The proposed model achieves 4.5 $\times$ faster inference with a significantly smaller model (0.5B).

The methodology involves an iterative two-player framework: a generator and a guardrail classifier. The generator, denoted as $\mathcal{G}_{\bphi}$, takes a sample from a seed dataset $\mathcal{S}$ and a specified language $\ell$ as input, and outputs a sample text sequence $\tilde{\xb}_i$ in that language that preserves the toxicity label $y_i$ of $\xb_i$. The defensive classifier, denoted as $\mathcal{C}_{\btheta}: \cX \rightarrow y$, takes the generated query as input and outputs the probability of toxicity.

The classifier update at iteration $t+1$ is defined as: $\btheta_{t+1} = \argmax_{\btheta} L_{\mathcal{C}^t(\btheta)}$, where $L_{\mathcal{C}^t(\btheta) = \EE_{\tilde\xb\sim p_{\bphi_t}(\tilde\xb|\xb,y)} \big[-\log p_{\btheta}(y|\tilde\xb)\big]}$.

$\btheta$: Model weight of the classifier
$L_{\mathcal{C}^t}$ : Loss function of the classifier at iteration t
$p_{\bphi_t}$: Conditional probability distribution of the generator at iteration t
$\tilde{\xb}$: Generated sequence
$\xb$: Input sequence
$y$ : Toxicity label

The generator update is based on the reward signal $r_t\big((\xb,y), \tilde\xb \big) = - \log p_{\btheta_t} (y | \tilde\xb)$, where a higher value indicates greater vulnerability of the classifier to adversarial samples. The generator $\mathcal{G}_{\bphi}$ is updated by minimizing the Direct Preference Optimization (DPO) objective:

$\bphi_{t+1} = \argmax_{\bphi} L^t_{\mathcal{G}(\bphi, \bphi_{\text{ref})}$, where $L_{\mathcal{G}(\bphi, \bphi_{\text{ref}) = \EE_{\tilde \xb_w, \tilde\xb_l \sim p_{\bphi_t}(\tilde \xb|\xb,y)} \PP(\tilde \xb_w \succ \tilde\xb_l | \xb,y), \nonumber \quad \bigg[\ell\bigg(\beta \log \frac{p_{\bphi}(\tilde{\xb}_w | \xb, y)}{p_{\bphi_{\text{ref}(\tilde{\xb}_w | \xb, y)} - \beta \log \frac{p_{\bphi}(\tilde{\xb}_l | \xb, y)}{p_{\bphi_{\text{ref}(\tilde{\xb}_l | \xb, y)}\bigg)\bigg]}$.

$\bphi$: Model weight of the generator
$L_{\mathcal{G}}$ : Loss function of the generator
$p_{\bphi_t}$: Conditional probability distribution of the generator at iteration t
$\tilde{\xb}_w$: Preferred generated sample
$\tilde{\xb}_l$: Dispreferred generated sample
$\PP_t$: Probability of preferring $\tilde{\xb}_w$ over $\tilde{\xb}_l$ at iteration t
$\beta$ : Regularization parameter
$\bphi_{\text{ref}}$: Reference generator model

The paper connects the DPO objective with the Proximal Policy Optimization (PPO) training objective, demonstrating that the algorithm optimizes a minimax game with the objective:

$\min_{p_{\btheta} \max_{p_{\bphi} \mathbb{E}_{\substack{\tilde{\xb} \sim p_{\phi}\big[- \log p_{\theta} (y | \tilde \xb) \big] - \beta D_{\text{KL}(p_{\phi} || p_{\text{ref})}$.

$p_{\btheta}$: Probability distribution of the classifier
$p_{\bphi}$: Probability distribution of the generator
$\tilde{\xb}$: Generated sample
$y$ : Toxicity label
$D_{\text{KL}}$ : Kullback-Leibler divergence

The paper includes a theorem stating that the minimax game admits a Nash equilibrium, and the iterative updates converge linearly to the Nash equilibrium with an appropriately chosen regularization parameter $\beta$ .

The method uses two distinct prompts $\bc_{y:y=\pm 1}$ for generating samples, based on whether the input is safe or unsafe: $\tilde{\xb}^{(i)} \sim p_{\bphi_{t-1}(\tilde{\xb} | \xb^{(i)}, \bc_{y^{(i)})$. The training data $\mathcal{S}^{(t)}$ at iteration $t$ is augmented exclusively with misclassified synthetic samples. A multi-label classification setup is adopted, using binary cross-entropy loss for each of the 12 defined harmful classes.

The paper uses Qwen2.5-0.5B and Qwen2.5-1.5B as the base models for the classifier, and dolphin-2.9.4-llama3.1-8b as the base model for the generator. It is compared against LlamaGuard3 (1B), ShieldGemma (2B), LlamaGuard2 (8B), and LlamaGuard3 (8B). The seed dataset combines existing open-source data related to safety and toxicity.

The evaluation is conducted in English, French, German, and Spanish, using six safety datasets: XSTest, ToxicChat, OpenAI Moderation, Beavertails, RTP-LX, and XSafety. The paper also includes a weak-to-strong generalization experiment, using the training data generated by the two-player framework to train Llama-3.2 (1B) and Qwen-2.5 (1.5B). An ablation study is performed to evaluate the impact of multilingual data and synthetic data.