Papers
Topics
Authors
Recent
Search
2000 character limit reached

Calibrated Regularized Policy Gradient

Updated 31 January 2026
  • CRPG is an algorithmic framework for constrained policy optimization that enforces correctness, safety, and robustness via calibrated rewards.
  • It employs a primal-only, non-dual optimization approach with KL divergence regularization to guarantee monotonic policy improvement and suppress constraint violations.
  • CRPG integrates a Mixture of Judges mechanism to enforce multiple constraints across tasks, achieving scalable compliance demonstrated by strong empirical benchmarks.

Calibrated Regularized Policy Gradient (CRPG) is an algorithmic update formulation designed for constrained policy optimization in reinforcement learning scenarios where exact constraint satisfaction and robustness to reward hacking are required. CRPG constitutes a single-task component within the Constrained Generative Policy Optimization (CGPO) paradigm, enforcing correctness and safety constraints in LLM fine-tuning via Reinforcement Learning from Human Feedback (RLHF) and more generally in policy optimization settings requiring hard compliance. The CRPG update guarantees monotone improvement of the calibrated objective, polynomially suppressed constraint violation probability, and is applicable in multi-constraint, multi-objective regimes using non-dual, primal-only optimization approaches.

1. Optimization Problem Formulation

CRPG is defined over a Markov decision process where each state ss typically represents a prompt and each action aa represents a complete response (token sequence for LLMs). The core objective for unconstrained RLHF is:

maxw  J(πw)=EsD,aπw(s)[rϕ(s,a)]\max_{w}\;J(\pi_w) = \mathbb{E}_{s\sim \mathcal{D},\,a\sim\pi_w(\cdot|s)}\big[r_\phi(s,a)\big]

where rϕ(s,a)r_\phi(s,a) is a reward model trained to reflect preference-based evaluation. CRPG introduces MM constraints {Ck}\{C_k\} implemented as sets Σk\Sigma_k of feasible (s,a)(s,a) pairs to enforce correctness, safety, factuality, and other criteria. The intersection Σ=kΣk\Sigma = \bigcap_k \Sigma_k represents the global feasible region. KL divergence to a reference policy πref\pi_{\rm ref} is bounded as a regularization:

EsDDKL(πw(s)πref(s))KLmax\mathbb{E}_{s\sim\mathcal{D}}\,D_{\mathrm{KL}}(\pi_w(\cdot|s)\,\|\,\pi_{\rm ref}(\cdot|s)) \le \mathrm{KL}_{\max}

The full CRPG constrained optimization problem is:

$\begin{aligned} &\max_{w}\; J_c(\pi_w) = \mathbb{E}_{s\sim\mathcal{D},\,a\sim\pi_w}\left[R_{\mathrm{calib}}(s,a)\;_{\{(s,a)\in\Sigma\}}\right] \ &\text{subject to} \quad \Pr\big[(s,a)\in\Sigma\big] = 1, \quad \mathbb{E}_s\big[D_{\mathrm{KL}}(\pi_w\|\pi_{\rm ref})\big] \leq \mathrm{KL}_{\max} \end{aligned}$

Rcalib(s,a)R_{\mathrm{calib}}(s,a) is the calibrated reward (see below) and the reward is masked to zero outside of Σ\Sigma.

2. Calibrated Reward Design

To ensure comparability and mitigate reward hacking, CRPG utilizes a calibrated reward:

Rcalib(s,a)=σ(rϕ(s,a)rϕ(s,aˉ)),0Rcalib1R_{\mathrm{calib}}(s,a) = \sigma\big(r_\phi(s,a) - r_\phi(s,\bar a)\big), \quad 0 \le R_{\mathrm{calib}} \le 1

Here, aˉπref(s)\bar a \sim \pi_{\rm ref}(\cdot|s) is a reference response, and σ\sigma is a scalar normalization (e.g., sigmoid). The calibration centers each prompt's reward against its reference, eliminating prompt-by-prompt bias and making the gradient update robust to diverse prompt distributions.

3. CRPG Algorithmic Update and Constraint Enforcement

The CRPG update step is conducted in a primal-only fashion, without explicit Lagrange multipliers for constraint satisfaction. For parameter vector ww, the update is

wt+1=wt+αtgc(πwt)w_{t+1} = w_t + \alpha_t\,g_c(\pi_{w_t})

with the calibrated regularized policy gradient

gc(π)=E[logπ(s,a)Rcalib(s,a)]g_c(\pi) = \mathbb{E}\big[\nabla\log\pi(s,a)\,R_{\mathrm{calib}}(s,a)\big]

Updates are computed over samples (s,a)(s,a) drawn from D\mathcal{D} and πwt\pi_{w_t}, masked to feasible (s,a)Σ(s,a)\in\Sigma. Infeasible samples receive zero reward gradient. KL regularization is enforced either by excluding samples exceeding the KL threshold or by applying a hard cutoff post-update.

Empirically, this update is implemented using mini-batches of prompts and responses, judge modules for multi-constraint validation, and pre-training (e.g., DPO warm-up) before RLHF fine-tuning. Optionally, multi-task extensions combine per-task gradients with fixed mixture weights for Pareto compromise.

4. Mixture of Judges Mechanism

CRPG enforces constraints using "Mixture of Judges" (MoJ), a set of MM judge modules Jh(s,a){0,1}J_h(s,a)\in\{0,1\}. Each judge corresponds to a discrete constraint (correctness, safety, instruction following, etc.). Aggregation is defined as:

I{(s,a)Σ}=h=1MJh(s,a)I_{\{(s,a)\in\Sigma\}} = \prod_{h=1}^M J_h(s,a)

If any judge returns zero, (s,a)(s,a) is masked out from the gradient update. Judges can be rule-based (regex, code execution, string matching) or LLM-based (model-prompted validations for hallucination/safety/refusal). Stratification partitions the prompt pool for multi-task scenarios, allowing per-task reward models, judge sets, and optimizers.

5. Theoretical Guarantees and Convergence Rates

For the single-task CRPG update, theoretical analysis demonstrates that, under mild smoothness assumptions:

Jc(πc)Jc(πwt)=O(1/poly(t)),Prπwt[(s,a)Σ]=O(1/poly(t))J_c(\pi^*_c) - J_c(\pi_{w_t}) = O\big(1/\mathrm{poly}(t)\big), \quad \Pr_{\pi_{w_t}}\big[(s,a)\notin\Sigma\big] = O\big(1/\mathrm{poly}(t)\big)

where πc\pi^*_c is the globally optimal feasible policy. This guarantees polynomially fast convergence to optimality and vanishing constraint violation probability. In multi-task MoJ-augmented CRPG, the gradient combination yields Pareto-stationary compromise with proportional respect for individual task objectives.

A plausible implication is that this construction effectively suppresses reward hacking by hard constraint masking, since infeasible samples are excluded from updates regardless of reward magnitudes.

6. Empirical Evaluation and Benchmark Results

CGPO-CRPG, as evaluated in the RLHF context, demonstrates strong empirical performance across diverse domains (chat, instruction following, math, code, knowledge, safety). Key results (main entries from Table 4 of (Xu et al., 2024)):

Task / Metric PPO DPO CGPO–CRPG
AlpacaEval-2 (chat) 24.8 16.3 25.9
Arena-Hard (STEM) 24.3 18.3 31.2
IFEval 0.81 0.79 0.83
MATH/GSM8K (math) 0.46/91 0.45/90 0.48/93
MBPP/HumanEval (code) 0.002/0.006 0.49/0.59 0.63/0.76
SVR / FRR (safety) 0.03/0.12 0.02/0.17 0.05/0.04

CRPG outperforms PPO and DPO on coding/math tasks and maintains low safety violation and false refusal rates. Reward hacking is visible in PPO's collapse on 0-shot coding benchmarks, while CRPG maintains integrity. Ablation studies confirm that removing MoJ causes CRPG to also suffer reward hacking. DPO warm-up prior to CGPO-CRPG fine-tuning yields consistently higher final performance.

7. Practical Considerations and Limitations

CRPG requires construction of judge modules for all constraints of interest. KL maximum must be set in accordance with reference policy proximity needs. Empirical protocol typically involves a short DPO warm-up followed by several hundred RLHF-CRPG updates. The constraint masking mechanism avoids the need for explicit dual variables or global weight-tuning over constraints, enhancing scalability to extreme multi-objective regimes.

Limitations include dependence on reliable judge modules—incorrect constraint modules can suppress valid solutions. Hyperparameter choices (KL maximum, mixture weights in multi-task settings) affect final performance. Extension to dynamic or adversarial judge ensembles, or to policy classes with non-differentiable structure, remains open for future investigation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Calibrated Regularized Policy Gradient (CRPG).