Calibrated Regularized Policy Gradient
- CRPG is an algorithmic framework for constrained policy optimization that enforces correctness, safety, and robustness via calibrated rewards.
- It employs a primal-only, non-dual optimization approach with KL divergence regularization to guarantee monotonic policy improvement and suppress constraint violations.
- CRPG integrates a Mixture of Judges mechanism to enforce multiple constraints across tasks, achieving scalable compliance demonstrated by strong empirical benchmarks.
Calibrated Regularized Policy Gradient (CRPG) is an algorithmic update formulation designed for constrained policy optimization in reinforcement learning scenarios where exact constraint satisfaction and robustness to reward hacking are required. CRPG constitutes a single-task component within the Constrained Generative Policy Optimization (CGPO) paradigm, enforcing correctness and safety constraints in LLM fine-tuning via Reinforcement Learning from Human Feedback (RLHF) and more generally in policy optimization settings requiring hard compliance. The CRPG update guarantees monotone improvement of the calibrated objective, polynomially suppressed constraint violation probability, and is applicable in multi-constraint, multi-objective regimes using non-dual, primal-only optimization approaches.
1. Optimization Problem Formulation
CRPG is defined over a Markov decision process where each state typically represents a prompt and each action represents a complete response (token sequence for LLMs). The core objective for unconstrained RLHF is:
where is a reward model trained to reflect preference-based evaluation. CRPG introduces constraints implemented as sets of feasible pairs to enforce correctness, safety, factuality, and other criteria. The intersection represents the global feasible region. KL divergence to a reference policy is bounded as a regularization:
The full CRPG constrained optimization problem is:
$\begin{aligned} &\max_{w}\; J_c(\pi_w) = \mathbb{E}_{s\sim\mathcal{D},\,a\sim\pi_w}\left[R_{\mathrm{calib}}(s,a)\;_{\{(s,a)\in\Sigma\}}\right] \ &\text{subject to} \quad \Pr\big[(s,a)\in\Sigma\big] = 1, \quad \mathbb{E}_s\big[D_{\mathrm{KL}}(\pi_w\|\pi_{\rm ref})\big] \leq \mathrm{KL}_{\max} \end{aligned}$
is the calibrated reward (see below) and the reward is masked to zero outside of .
2. Calibrated Reward Design
To ensure comparability and mitigate reward hacking, CRPG utilizes a calibrated reward:
Here, is a reference response, and is a scalar normalization (e.g., sigmoid). The calibration centers each prompt's reward against its reference, eliminating prompt-by-prompt bias and making the gradient update robust to diverse prompt distributions.
3. CRPG Algorithmic Update and Constraint Enforcement
The CRPG update step is conducted in a primal-only fashion, without explicit Lagrange multipliers for constraint satisfaction. For parameter vector , the update is
with the calibrated regularized policy gradient
Updates are computed over samples drawn from and , masked to feasible . Infeasible samples receive zero reward gradient. KL regularization is enforced either by excluding samples exceeding the KL threshold or by applying a hard cutoff post-update.
Empirically, this update is implemented using mini-batches of prompts and responses, judge modules for multi-constraint validation, and pre-training (e.g., DPO warm-up) before RLHF fine-tuning. Optionally, multi-task extensions combine per-task gradients with fixed mixture weights for Pareto compromise.
4. Mixture of Judges Mechanism
CRPG enforces constraints using "Mixture of Judges" (MoJ), a set of judge modules . Each judge corresponds to a discrete constraint (correctness, safety, instruction following, etc.). Aggregation is defined as:
If any judge returns zero, is masked out from the gradient update. Judges can be rule-based (regex, code execution, string matching) or LLM-based (model-prompted validations for hallucination/safety/refusal). Stratification partitions the prompt pool for multi-task scenarios, allowing per-task reward models, judge sets, and optimizers.
5. Theoretical Guarantees and Convergence Rates
For the single-task CRPG update, theoretical analysis demonstrates that, under mild smoothness assumptions:
where is the globally optimal feasible policy. This guarantees polynomially fast convergence to optimality and vanishing constraint violation probability. In multi-task MoJ-augmented CRPG, the gradient combination yields Pareto-stationary compromise with proportional respect for individual task objectives.
A plausible implication is that this construction effectively suppresses reward hacking by hard constraint masking, since infeasible samples are excluded from updates regardless of reward magnitudes.
6. Empirical Evaluation and Benchmark Results
CGPO-CRPG, as evaluated in the RLHF context, demonstrates strong empirical performance across diverse domains (chat, instruction following, math, code, knowledge, safety). Key results (main entries from Table 4 of (Xu et al., 2024)):
| Task / Metric | PPO | DPO | CGPO–CRPG |
|---|---|---|---|
| AlpacaEval-2 (chat) | 24.8 | 16.3 | 25.9 |
| Arena-Hard (STEM) | 24.3 | 18.3 | 31.2 |
| IFEval | 0.81 | 0.79 | 0.83 |
| MATH/GSM8K (math) | 0.46/91 | 0.45/90 | 0.48/93 |
| MBPP/HumanEval (code) | 0.002/0.006 | 0.49/0.59 | 0.63/0.76 |
| SVR / FRR (safety) | 0.03/0.12 | 0.02/0.17 | 0.05/0.04 |
CRPG outperforms PPO and DPO on coding/math tasks and maintains low safety violation and false refusal rates. Reward hacking is visible in PPO's collapse on 0-shot coding benchmarks, while CRPG maintains integrity. Ablation studies confirm that removing MoJ causes CRPG to also suffer reward hacking. DPO warm-up prior to CGPO-CRPG fine-tuning yields consistently higher final performance.
7. Practical Considerations and Limitations
CRPG requires construction of judge modules for all constraints of interest. KL maximum must be set in accordance with reference policy proximity needs. Empirical protocol typically involves a short DPO warm-up followed by several hundred RLHF-CRPG updates. The constraint masking mechanism avoids the need for explicit dual variables or global weight-tuning over constraints, enhancing scalability to extreme multi-objective regimes.
Limitations include dependence on reliable judge modules—incorrect constraint modules can suppress valid solutions. Hyperparameter choices (KL maximum, mixture weights in multi-task settings) affect final performance. Extension to dynamic or adversarial judge ensembles, or to policy classes with non-differentiable structure, remains open for future investigation.