Penalized PPO (P3O) Algorithms

Updated 9 February 2026

P3O is a family of policy optimization algorithms that extend PPO by incorporating explicit penalty terms (e.g., KL divergences) to regulate policy updates.
Its design improves stability and performance across applications like safe RL, fairness-aware multi-agent systems, and RLHF by controlling update sizes and enforcing constraints.
Adaptive tuning of penalty coefficients in P3O facilitates effective trade-offs between exploration and exploitation while ensuring convergence to constraint-satisfying solutions.

Penalized Proximal Policy Optimization (P3O) encompasses a family of policy optimization algorithms that augment the original Proximal Policy Optimization (PPO) framework with explicit penalty terms. These penalties—most commonly based on Kullback–Leibler (KL) divergences or application-specific objectives—control the size of policy updates, enforce constraints, or regularize learning dynamics. P3O variants arise in distinct domains, including classic control, reinforcement learning from human feedback (RLHF), fairness-aware multi-agent systems, and safe/constrained RL. The central methodology replaces or extends the PPO surrogate with additional penalty components, yielding unconstrained yet robust optimization in both theoretical and practical contexts.

1. Core Penalized Surrogate Objectives

Penalized-PPO modifies the classic PPO surrogate, replacing or supplementing the clipped-importance-ratio term with explicit penalties. The canonical penalized objective introduced by Schulman et al. is: $L^{\mathrm{KLPEN}}(\theta) = \mathbb{E}_t \left[ \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\rm old}}(a_t\mid s_t)} \,\hat{A}_t - \beta \, D_{\mathrm{KL}}[\pi_{\theta_{\rm old}}(\cdot\mid s_t)\, \| \,\pi_\theta(\cdot\mid s_t)] \right]$ where $\beta > 0$ is the penalty coefficient. $\hat{A}_t$ is an (optionally GAE-computed) advantage estimate. The KL term acts as a "soft" trust region by penalizing large deviations in policy space, in contrast to TRPO's hard KL constraint. This "soft constraint" framework underlies all classical penalized PPO variants (Schulman et al., 2017, Hsu et al., 2020).

Subsequent P3O extensions embed penalized surrogates into a variety of settings:

Fairness constraints: additional penalty terms for outcome and value disparities (retrospective/prospective) (Malfa et al., 6 Feb 2025).
Safety constraints: penalty surrogates for cost violations or barrier functions (Zhang et al., 2022, Hazra et al., 11 Sep 2025).
Relative/preference optimization in RLHF: pairwise and pessimistic penalties wrt reward uncertainty or preference difference (Wu et al., 2023, Gupta et al., 10 Mar 2025).
"Soft clipping" by replacing hard min{} operations with sigmoid-based surrogates and KL penalties (Chen et al., 2022).

2. Algorithmic Design and Pseudocode

A prototypical P3O update cycle proceeds as follows (Schulman et al., 2017, Hsu et al., 2020):

Collect rollouts using the current or "old" policy $\pi_{\theta_{\rm old}}$ .
Compute advantage estimates (e.g., by GAE).
Optimize the penalized surrogate for several epochs/minibatches:
- Compute the standard PG term and penalty (KL or task-specific).
- Optional: tune or adapt $\beta$ to maintain a target divergence.
- Apply policy and value-network parameter updates with SGD.
Update policy reference ( $\theta \leftarrow \theta_{\rm old})$ .

For KL-based P3O, a typical pseudocode (after Schulman et al.) is:

for iteration in range(...):
    # 1. Collect trajectory data under πθ_old
    # 2. Compute advantages
    for epoch in range(K):
        for minibatch in batches:
            L_pi = mean[ ρt(θ) * Ât – β * KL(πθ_old(·|st) || πθ(·|st)) ]
            L_v = mean[ (Vφ(st) – Rt)^2 ]
            total_loss = –L_pi + c1*L_v – c2*Lent
            θ ← θ – απ ∇_θ total_loss
            φ ← φ – αφ ∇_φ L_v
    # 3. Adapt β if avg_KL is outside target range
    θ_old ← θ; φ_old ← φ

P3O implementations in safe/fairness RL and RLHF task settings inject additional penalty computation and sometimes incorporate trajectory- or pairwise sampling schemes (Wu et al., 2023, Hazra et al., 11 Sep 2025).

3. Penalty Tuning, Adaptation, and Theoretical Guarantees

Penalty coefficients critically modulate the trade-off between policy improvement and update regularization:

For KL penalties, adaptive schedules monitor the empirical average KL divergence per iteration and multiplicatively adjust $\beta$ to maintain a target value (e.g., increase if $\mathrm{KL} >1.5\,\mathrm{target}$ , decrease if $<0.5\,\mathrm{target}$ ) (Schulman et al., 2017).
In multi-constraint/safe RL, penalty factors (e.g., $\kappa$ for cost constraints) are set as bounds on dual multipliers, rendering the penalty formulation exact—i.e., with suitable $\kappa$ , the penalized unconstrained problem admits the same minimizer as the original constrained objective (Zhang et al., 2022).
Sigmoid-based "soft clipping" introduces a temperature/scale parameter $\tau$ and a KL penalty weight $\beta$ ; tuning these is essential for controlling the exploration-conservatism trade-off (Chen et al., 2022).

Theoretical analyses for P3O variants show:

Exact penalty reformulation ensures minimizers correspond to original constraint-satisfying solutions for sufficiently large penalty (Zhang et al., 2022, Hazra et al., 11 Sep 2025).
P3O recovers TRPO/MD mirror-descent update dynamics in the $\beta\to\infty$ (hard KL) limit (Schulman et al., 2017, Hsu et al., 2020).
In RLHF, pessimistic P3O has Nash-equilibrium-style regret guarantees under preference uncertainty (Gupta et al., 10 Mar 2025).
Robustness to parameter drift, reward shift, and constraint violation is stronger under penalty control than classic clipped-PPO (Hsu et al., 2020, Wu et al., 2023).

4. Connections to PPO, TRPO, and Alternative Surrogates

P3O generalizes PPO and connects to TRPO and other regularized policy gradient schemes:

Clipped PPO utilizes a min operation to restrict the importance sampling ratio $\rho_t$ to $[1 - \epsilon, 1 + \epsilon]$ , yielding implicit step-size regularization. However, clipped-PPO features zones of zero gradient for out-of-band ratios and can fail in discontinuous reward settings or high-dimensional action spaces (Hsu et al., 2020, Chen et al., 2022).
KL-penalized PPO (P3O) applies a smooth, differentiable penalty across the full parameter space, providing a "restorative" force that pulls updates back toward the behavior policy. This prevents catastrophic updates in challenging regimes (Schulman et al., 2017, Hsu et al., 2020).
Soft-clipping P3O replaces the hard min{} in the surrogate with a sigmoid in the ratio domain, creating a continuous gradient everywhere and greatly expanding the accessible policy region. Under the DEON off-policyness metric, soft-P3O demonstrates broader exploration beyond PPO’s restricted update region (Chen et al., 2022).

5. Empirical Results and Comparative Performance

Empirical studies across canonical and specialized domains have established key performance characteristics:

On MuJoCo and Atari benchmarks, penalized PPO (KL-penalty) matches or exceeds TRPO in sample efficiency and stability, with clipped PPO slightly more robust to hyperparameters but P3O matching or exceeding performance when tuned (Schulman et al., 2017, Hsu et al., 2020).
Soft-clipping P3O (sigmoid surrogate) achieves higher CPI objective values, greater exploratory coverage (DEON), and higher final returns than standard PPO, especially in continuous control and large discrete action settings (Chen et al., 2022).
In fairness-aware RL, P3O reduces demographic parity and conditional statistical parity disparities by up to ∼60%, albeit with a commensurate reduction in mean reward ("price of fairness"); both attribute groups experience similar losses, indicating systemic policy adjustment (Malfa et al., 6 Feb 2025).
Safe/constrained RL settings show P3O and incremental-penalized variants (e.g., IP3O with CELU barriers) outperforming state-of-the-art safe RL baselines for joint cost satisfaction and sustained reward, with formal suboptimality bounds as a function of KL divergence, penalty weight, and barrier parameterization (Zhang et al., 2022, Hazra et al., 11 Sep 2025).
In RLHF, pairwise-P3O and pessimistic-P3O exhibit invariance to reward shifts (avoiding PPO’s sensitivity), superior KL–reward trade-off, and robust mitigation of reward hacking, validated by human and GPT-4 proxy win-rates (Wu et al., 2023, Gupta et al., 10 Mar 2025).

Variant	Core Penalty	Application	Empirical Notes
KL-P3O	KL divergence	Classic RL, TRPO regime	Matches/exceeds TRPO, robust in discontinuous settings
Soft-clip P3O	Sigmoid & KL	High-dim, off-policy	Greater exploration, higher CPI, lower variance
Fairness P3O	Outcome, value disp	Multi-agent, fairness	Up to 60% disparity reduction, trade-off with total reward
Safe P3O/IP3O	Cost barrier/penalty	CMDP, safe RL	Satisfies multi-constraints, stable gradients, few viol.
Pairwise/Pessimistic P3O	Reward diff or LCB	RLHF, LLM alignment	Reward-shift invariant, strong KL–reward, mitigates hacks

6. Extensions: Fairness, Safety, RLHF, and Off-Policy Integration

P3O supports diverse extensions:

Fairness-aware regularization: Penalizes return/value disparities under multiple metrics—demographic, counterfactual, conditional statistical parity—augmented into the PPO surrogate (Malfa et al., 6 Feb 2025).
Safe RL / CMDP: P3O and IP3O facilitate multi-constraint enforcement via exact penalty (ReLU, barrier, CELU), supporting both decentralized multi-agent and centralized critics (Zhang et al., 2022, Hazra et al., 11 Sep 2025).
RLHF / preference optimization: P3O supports trajectory-wise, reward-shift-invariant surrogates (pairwise or pessimistic lower-confidence bounds), eliminating value-network complexity and achieving monotonic improvement under weak reward identification (Wu et al., 2023, Gupta et al., 10 Mar 2025).
Off-policy efficiency: Policy-on Policy-off Policy Optimization (P3O) interleaves on-policy and off-policy batches with adaptive clipping and KL penalty, controlled by the effective sample size between policies. This bridges high sample efficiency of off-policy data with PPO’s stability (Fakoor et al., 2019).
Multi-agent scalability: P3O naturally extends to Dec-POMDPs, allowing each agent to optimize a penalized surrogate using local observations and global constraints (Zhang et al., 2022).

7. Practical Implementation Guidelines

Robust P3O deployment involves:

Choosing batch size, epoch count, and learning rates as in PPO (e.g., 2048 steps per iteration, batch size 64, 10 epochs, $3\times10^{-4}$ policy LR).
Using GAE ( $\lambda\approx0.95$ ) for reducing variance.
Gradient norm clipping (e.g., 0.5) to stabilize updates.
Monitoring and adaptively tuning penalty weights (e.g., $\beta$ for KL), especially in dynamic or non-stationary regimes.
For fairness or constrained tasks, sweeping the penalty hyperparameters ( $\lambda_{\rm ret},\lambda_{\rm pros},\kappa,\eta$ ) to obtain desired Pareto trade-offs between performance and constraint satisfaction.
For RLHF, leveraging batchwise trajectory sampling, maintaining equivalence in data volume to compare with PPO, and using low KL penalties for rapid improvement without significant distributional shift.

References

J. Schulman et al., "Proximal Policy Optimization Algorithms" (Schulman et al., 2017)
S. Hsu et al., "Revisiting Design Choices in Proximal Policy Optimization" (Hsu et al., 2020)
Z. Chen et al., "Penalized Proximal Policy Optimization for Safe Reinforcement Learning" (Zhang et al., 2022)
Zihan Xie et al., "The Sufficiency of Off-Policyness and Soft Clipping: PPO is still Insufficient according to an Off-Policy Measure" (Chen et al., 2022)
Soroush Rafiee et al., "P3O: Policy-on Policy-off Policy Optimization" (Fakoor et al., 2019)
Luna Li et al., "Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment" (Wu et al., 2023)
Siddhant Gupta et al., "Mitigating Preference Hacking in Policy Optimization with Pessimism" (Gupta et al., 10 Mar 2025)
Y. Guo et al., "Fairness Aware Reinforcement Learning via Proximal Policy Optimization" (Malfa et al., 6 Feb 2025)
Y. Du et al., "Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning" (Hazra et al., 11 Sep 2025)