- The paper introduces DuoGuard, a two-player adversarial RL framework where a generator and guardrail model co-evolve to create synthetic multilingual safety data for training LLM guardrails.
- DuoGuard achieves a nearly 10% performance improvement over LlamaGuard3 (8B) on English benchmarks and offers significantly faster inference using a much smaller model (0.5B).
- The framework formalizes the interaction as a two-player game, proves convergence to a Nash equilibrium, and connects its DPO objective to PPO for optimization.
This paper introduces a two-player Reinforcement Learning (RL) framework to address the scarcity of multilingual safety data for training LLM guardrails. The approach involves a generator and a guardrail model that co-evolve adversarially to produce synthetic data for multilingual guardrail training. The interaction between the generator and guardrail model is formalized as a two-player game, with a proof of convergence to a Nash equilibrium.
The paper claims the following:
- The proposed model achieves a nearly 10\% improvement over LlamaGuard3 (8B) on English benchmarks.
- The proposed model achieves 4.5× faster inference with a significantly smaller model (0.5B).
The methodology involves an iterative two-player framework: a generator and a guardrail classifier. The generator, denoted as $\mathcal{G}_{\bphi}$, takes a sample from a seed dataset S and a specified language ℓ as input, and outputs a sample text sequence $\tilde{\xb}_i$ in that language that preserves the toxicity label yi​ of $\xb_i$. The defensive classifier, denoted as $\mathcal{C}_{\btheta}: \cX \rightarrow y$, takes the generated query as input and outputs the probability of toxicity.
The classifier update at iteration t+1 is defined as:
$\btheta_{t+1} = \argmax_{\btheta} L_{\mathcal{C}^t(\btheta)}$, where $L_{\mathcal{C}^t(\btheta) = \EE_{\tilde\xb\sim p_{\bphi_t}(\tilde\xb|\xb,y)} \big[-\log p_{\btheta}(y|\tilde\xb)\big]}$.
- $\btheta$: Model weight of the classifier
- LCt​: Loss function of the classifier at iteration t
- $p_{\bphi_t}$: Conditional probability distribution of the generator at iteration t
- $\tilde{\xb}$: Generated sequence
- $\xb$: Input sequence
- y: Toxicity label
The generator update is based on the reward signal $r_t\big((\xb,y), \tilde\xb \big) = - \log p_{\btheta_t} (y | \tilde\xb)$, where a higher value indicates greater vulnerability of the classifier to adversarial samples. The generator $\mathcal{G}_{\bphi}$ is updated by minimizing the Direct Preference Optimization (DPO) objective:
$\bphi_{t+1} = \argmax_{\bphi} L^t_{\mathcal{G}(\bphi, \bphi_{\text{ref})}$, where $L_{\mathcal{G}(\bphi, \bphi_{\text{ref}) = \EE_{\tilde \xb_w, \tilde\xb_l \sim p_{\bphi_t}(\tilde \xb|\xb,y)} \PP(\tilde \xb_w \succ \tilde\xb_l | \xb,y), \nonumber \quad \bigg[\ell\bigg(\beta \log \frac{p_{\bphi}(\tilde{\xb}_w | \xb, y)}{p_{\bphi_{\text{ref}(\tilde{\xb}_w | \xb, y)} - \beta \log \frac{p_{\bphi}(\tilde{\xb}_l | \xb, y)}{p_{\bphi_{\text{ref}(\tilde{\xb}_l | \xb, y)}\bigg)\bigg]}$.
- $\bphi$: Model weight of the generator
- LG​: Loss function of the generator
- $p_{\bphi_t}$: Conditional probability distribution of the generator at iteration t
- $\tilde{\xb}_w$: Preferred generated sample
- $\tilde{\xb}_l$: Dispreferred generated sample
- $\PP_t$: Probability of preferring $\tilde{\xb}_w$ over $\tilde{\xb}_l$ at iteration t
- β: Regularization parameter
- $\bphi_{\text{ref}}$: Reference generator model
The paper connects the DPO objective with the Proximal Policy Optimization (PPO) training objective, demonstrating that the algorithm optimizes a minimax game with the objective:
$\min_{p_{\btheta} \max_{p_{\bphi} \mathbb{E}_{\substack{\tilde{\xb} \sim p_{\phi}\big[- \log p_{\theta} (y | \tilde \xb) \big] - \beta D_{\text{KL}(p_{\phi} || p_{\text{ref})}$.
- $p_{\btheta}$: Probability distribution of the classifier
- $p_{\bphi}$: Probability distribution of the generator
- $\tilde{\xb}$: Generated sample
- y: Toxicity label
- DKL​: Kullback-Leibler divergence
The paper includes a theorem stating that the minimax game admits a Nash equilibrium, and the iterative updates converge linearly to the Nash equilibrium with an appropriately chosen regularization parameter β.
The method uses two distinct prompts $\bc_{y:y=\pm 1}$ for generating samples, based on whether the input is safe or unsafe: $\tilde{\xb}^{(i)} \sim p_{\bphi_{t-1}(\tilde{\xb} | \xb^{(i)}, \bc_{y^{(i)})$. The training data S(t) at iteration t is augmented exclusively with misclassified synthetic samples. A multi-label classification setup is adopted, using binary cross-entropy loss for each of the 12 defined harmful classes.
The paper uses Qwen2.5-0.5B and Qwen2.5-1.5B as the base models for the classifier, and dolphin-2.9.4-llama3.1-8b as the base model for the generator. It is compared against LlamaGuard3 (1B), ShieldGemma (2B), LlamaGuard2 (8B), and LlamaGuard3 (8B). The seed dataset combines existing open-source data related to safety and toxicity.
The evaluation is conducted in English, French, German, and Spanish, using six safety datasets: XSTest, ToxicChat, OpenAI Moderation, Beavertails, RTP-LX, and XSafety. The paper also includes a weak-to-strong generalization experiment, using the training data generated by the two-player framework to train Llama-3.2 (1B) and Qwen-2.5 (1.5B). An ablation study is performed to evaluate the impact of multilingual data and synthetic data.