Calibrated Regularized Policy Gradient

Updated 31 January 2026

CRPG is an algorithmic framework for constrained policy optimization that enforces correctness, safety, and robustness via calibrated rewards.
It employs a primal-only, non-dual optimization approach with KL divergence regularization to guarantee monotonic policy improvement and suppress constraint violations.
CRPG integrates a Mixture of Judges mechanism to enforce multiple constraints across tasks, achieving scalable compliance demonstrated by strong empirical benchmarks.

Calibrated Regularized Policy Gradient (CRPG) is an algorithmic update formulation designed for constrained policy optimization in reinforcement learning scenarios where exact constraint satisfaction and robustness to reward hacking are required. CRPG constitutes a single-task component within the Constrained Generative Policy Optimization (CGPO) paradigm, enforcing correctness and safety constraints in LLM fine-tuning via Reinforcement Learning from Human Feedback (RLHF) and more generally in policy optimization settings requiring hard compliance. The CRPG update guarantees monotone improvement of the calibrated objective, polynomially suppressed constraint violation probability, and is applicable in multi-constraint, multi-objective regimes using non-dual, primal-only optimization approaches.

1. Optimization Problem Formulation

CRPG is defined over a Markov decision process where each state $s$ typically represents a prompt and each action $a$ represents a complete response (token sequence for LLMs). The core objective for unconstrained RLHF is:

$\max_{w}\;J(\pi_w) = \mathbb{E}_{s\sim \mathcal{D},\,a\sim\pi_w(\cdot|s)}\big[r_\phi(s,a)\big]$

where $r_\phi(s,a)$ is a reward model trained to reflect preference-based evaluation. CRPG introduces $M$ constraints $\{C_k\}$ implemented as sets $\Sigma_k$ of feasible $(s,a)$ pairs to enforce correctness, safety, factuality, and other criteria. The intersection $\Sigma = \bigcap_k \Sigma_k$ represents the global feasible region. KL divergence to a reference policy $\pi_{\rm ref}$ is bounded as a regularization:

$\mathbb{E}_{s\sim\mathcal{D}}\,D_{\mathrm{KL}}(\pi_w(\cdot|s)\,\|\,\pi_{\rm ref}(\cdot|s)) \le \mathrm{KL}_{\max}$

The full CRPG constrained optimization problem is:

$\begin{aligned} &\max_{w}\; J_c(\pi_w) = \mathbb{E}_{s\sim\mathcal{D},\,a\sim\pi_w}\left[R_{\mathrm{calib}}(s,a)\;_{\{(s,a)\in\Sigma\}}\right] \ &\text{subject to} \quad \Pr\big[(s,a)\in\Sigma\big] = 1, \quad \mathbb{E}_s\big[D_{\mathrm{KL}}(\pi_w\|\pi_{\rm ref})\big] \leq \mathrm{KL}_{\max} \end{aligned}$

$R_{\mathrm{calib}}(s,a)$ is the calibrated reward (see below) and the reward is masked to zero outside of $\Sigma$ .

2. Calibrated Reward Design

To ensure comparability and mitigate reward hacking, CRPG utilizes a calibrated reward:

$R_{\mathrm{calib}}(s,a) = \sigma\big(r_\phi(s,a) - r_\phi(s,\bar a)\big), \quad 0 \le R_{\mathrm{calib}} \le 1$

Here, $\bar a \sim \pi_{\rm ref}(\cdot|s)$ is a reference response, and $\sigma$ is a scalar normalization (e.g., sigmoid). The calibration centers each prompt's reward against its reference, eliminating prompt-by-prompt bias and making the gradient update robust to diverse prompt distributions.

3. CRPG Algorithmic Update and Constraint Enforcement

The CRPG update step is conducted in a primal-only fashion, without explicit Lagrange multipliers for constraint satisfaction. For parameter vector $w$ , the update is

$w_{t+1} = w_t + \alpha_t\,g_c(\pi_{w_t})$

with the calibrated regularized policy gradient

$g_c(\pi) = \mathbb{E}\big[\nabla\log\pi(s,a)\,R_{\mathrm{calib}}(s,a)\big]$

Updates are computed over samples $(s,a)$ drawn from $\mathcal{D}$ and $\pi_{w_t}$ , masked to feasible $(s,a)\in\Sigma$ . Infeasible samples receive zero reward gradient. KL regularization is enforced either by excluding samples exceeding the KL threshold or by applying a hard cutoff post-update.

Empirically, this update is implemented using mini-batches of prompts and responses, judge modules for multi-constraint validation, and pre-training (e.g., DPO warm-up) before RLHF fine-tuning. Optionally, multi-task extensions combine per-task gradients with fixed mixture weights for Pareto compromise.

4. Mixture of Judges Mechanism

CRPG enforces constraints using "Mixture of Judges" (MoJ), a set of $M$ judge modules $J_h(s,a)\in\{0,1\}$ . Each judge corresponds to a discrete constraint (correctness, safety, instruction following, etc.). Aggregation is defined as:

$I_{\{(s,a)\in\Sigma\}} = \prod_{h=1}^M J_h(s,a)$

If any judge returns zero, $(s,a)$ is masked out from the gradient update. Judges can be rule-based (regex, code execution, string matching) or LLM-based (model-prompted validations for hallucination/safety/refusal). Stratification partitions the prompt pool for multi-task scenarios, allowing per-task reward models, judge sets, and optimizers.

5. Theoretical Guarantees and Convergence Rates

For the single-task CRPG update, theoretical analysis demonstrates that, under mild smoothness assumptions:

$J_c(\pi^*_c) - J_c(\pi_{w_t}) = O\big(1/\mathrm{poly}(t)\big), \quad \Pr_{\pi_{w_t}}\big[(s,a)\notin\Sigma\big] = O\big(1/\mathrm{poly}(t)\big)$

where $\pi^*_c$ is the globally optimal feasible policy. This guarantees polynomially fast convergence to optimality and vanishing constraint violation probability. In multi-task MoJ-augmented CRPG, the gradient combination yields Pareto-stationary compromise with proportional respect for individual task objectives.

A plausible implication is that this construction effectively suppresses reward hacking by hard constraint masking, since infeasible samples are excluded from updates regardless of reward magnitudes.

6. Empirical Evaluation and Benchmark Results

CGPO-CRPG, as evaluated in the RLHF context, demonstrates strong empirical performance across diverse domains (chat, instruction following, math, code, knowledge, safety). Key results (main entries from Table 4 of (Xu et al., 2024)):

Task / Metric	PPO	DPO	CGPO–CRPG
AlpacaEval-2 (chat)	24.8	16.3	25.9
Arena-Hard (STEM)	24.3	18.3	31.2
IFEval	0.81	0.79	0.83
MATH/GSM8K (math)	0.46/91	0.45/90	0.48/93
MBPP/HumanEval (code)	0.002/0.006	0.49/0.59	0.63/0.76
SVR / FRR (safety)	0.03/0.12	0.02/0.17	0.05/0.04

CRPG outperforms PPO and DPO on coding/math tasks and maintains low safety violation and false refusal rates. Reward hacking is visible in PPO's collapse on 0-shot coding benchmarks, while CRPG maintains integrity. Ablation studies confirm that removing MoJ causes CRPG to also suffer reward hacking. DPO warm-up prior to CGPO-CRPG fine-tuning yields consistently higher final performance.

7. Practical Considerations and Limitations

CRPG requires construction of judge modules for all constraints of interest. KL maximum must be set in accordance with reference policy proximity needs. Empirical protocol typically involves a short DPO warm-up followed by several hundred RLHF-CRPG updates. The constraint masking mechanism avoids the need for explicit dual variables or global weight-tuning over constraints, enhancing scalability to extreme multi-objective regimes.

Limitations include dependence on reliable judge modules—incorrect constraint modules can suppress valid solutions. Hyperparameter choices (KL maximum, mixture weights in multi-task settings) affect final performance. Extension to dynamic or adversarial judge ensembles, or to policy classes with non-differentiable structure, remains open for future investigation.

Markdown Report Issue Upgrade to Chat

References (1)

The Perfect Blend: Redefining RLHF with Mixture of Judges (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Calibrated Regularized Policy Gradient (CRPG).

Calibrated Regularized Policy Gradient

1. Optimization Problem Formulation

2. Calibrated Reward Design

3. CRPG Algorithmic Update and Constraint Enforcement

4. Mixture of Judges Mechanism

5. Theoretical Guarantees and Convergence Rates

6. Empirical Evaluation and Benchmark Results

7. Practical Considerations and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Calibrated Regularized Policy Gradient

1. Optimization Problem Formulation

2. Calibrated Reward Design

3. CRPG Algorithmic Update and Constraint Enforcement

4. Mixture of Judges Mechanism

5. Theoretical Guarantees and Convergence Rates

6. Empirical Evaluation and Benchmark Results

7. Practical Considerations and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research