Policy Gradient-Based Update (PGU)

Updated 27 January 2026

Policy Gradient-Based Update (PGU) is a reinforcement learning update rule that calculates derivatives of the expected return to iteratively improve policy parameters.
PGUs encompass variants like REINFORCE, NPG, PPO, and PGQ, effectively addressing both discrete and continuous control challenges under varying conditions.
PGUs extend to off-policy, hybrid, and multi-agent scenarios while offering convergence guarantees and improved sample efficiency through tailored scaling and variance reduction.

A policy gradient-based update (PGU) is a generic term for update rules in reinforcement learning (RL) that use derivatives of expected return with respect to policy parameters to incrementally improve a policy. PGUs form the methodological backbone of a wide range of on-policy, off-policy, actor–critic, and hybrid RL algorithms, supporting both discrete and continuous control, as well as centralized and decentralized (multi-agent) scenarios. This article delineates the mathematical structure, algorithmic varieties, theoretical properties, and practical usage of PGUs, drawing on contemporary research across the RL literature.

1. Mathematical Formulation and Frameworks

PGUs are formally grounded in the policy gradient theorem, which asserts that for a parameterized policy $\pi_\theta$ , the gradient of the expected discounted return $J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]$ is given by

$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s, a)\right]$

where $d^{\pi_\theta}(s)$ is the discounted state visitation distribution and $Q^{\pi_\theta}$ is the action-value function. Stochastic approximation instantiates this in sampled environments using gradient estimators:

$\Delta\theta = \eta \sum_{i} \nabla_\theta \log\pi_\theta(a_i|s_i)\,\hat{A}(s_i,a_i)$

where $\hat{A}(s,a)$ is a generalized advantage estimator. Core PGU variants adapt this template to address the challenges of function approximation, off-policy evaluation, and stability constraints (Laroche et al., 2022, Lehnert et al., 2015, Liu et al., 2022).

In off-policy settings, especially those with linear function approximation, the stationary distribution $d_{s,a}$ and the Bellman operator $T_\theta$ both depend on $\theta$ , requiring differentiation through the distribution. For instance, the PGQ (policy-gradient Q-learning) mechanism computes the gradient of the mean-squared projected Bellman error (MSPBE), incorporating correction terms to account for the policy-dependent distribution drift (Lehnert et al., 2015).

2. Structural Variants and Analytical Properties

PGUs exhibit broad structural variety, characterized by two principal axes:

Gradient Form (Base Direction): Choices include the classic log-derivative score function, value-centric gradients, natural gradient preconditioning, and their combination with Q-learning objectives.
Scaling Function: Multiplicative factors adjusting the gradient magnitude, such as raw advantage, exponential likelihood ratios (for importance sampling or proximal methods), or cross-entropy gradients.

A parametric framework for PGUs covers classical on-policy policy gradient, natural gradient, PPO (proximal policy optimization), self-imitation learning, maximum-likelihood-inspired variants, and others. The form and scaling axes can be formally represented as:

$J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]$ 0

with $J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]$ 1 a scaling function of on/off-policy and reward error arguments, and $J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]$ 2 a gradient base (Gummadi et al., 2022).

A representative table synthesizes these structural differences:

Variant	Base Direction	Scaling Function
REINFORCE	$J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]$ 3	Advantage $J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]$ 4
NPG	$J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]$ 5	Advantage $J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]$ 6
PPO	$J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]$ 7	Clipped ratio $J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]$ 8 adv.
PGQ (PGU)	TD-corrected base, see (Lehnert et al., 2015)	Bellman error components

Modified PGUs employing cross-entropy or direct parametrization can achieve faster "unlearning" of past suboptimal actions ( $J(\theta) = \mathbb{E}_{\pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]$ 9 steps) compared to standard PG ( $\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s, a)\right]$ 0), with robust monotonic improvement guarantees (Laroche et al., 2022).

3. Off-policy, Hybrid, and Multi-agent Extensions

Early off-policy PGUs such as GTD or TDC methods require fixed behavior policies. The PGQ class extends them with fully on-line, off-policy control, using a two-time-scale recursion for critic and actor parameters and correcting for the dependence of the stationary distribution on the policy (Lehnert et al., 2015). The update, in expectation, includes standard TD and additional "distribution drift" terms:

$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s, a)\right]$ 1

where $\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s, a)\right]$ 2 is the TD error and $\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s, a)\right]$ 3 is the secondary weight vector.

Hybrid approaches such as PGQL interpolate between regularized policy gradient and Q-learning, with empirical efficacy in leveraging both on- and off-policy updates (notably in large-scale benchmarks such as Atari) (O'Donoghue et al., 2016).

In decentralized and multi-agent environments, networked PGU schemes employ consensus protocols to align local policy parameters across time-varying communication graphs, using unbiased two-episode stochastic gradients. Under Markov potential game structure, such schemes provably converge almost surely to stationary points of the global potential, with an iteration complexity of $\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s, a)\right]$ 4 for $\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s, a)\right]$ 5-stationarity (Aydin et al., 2024).

4. Theoretical Guarantees: Convergence and Limitations

Convergence of PGUs is governed by the interplay between function approximation, the update structure, and discounting. For linear approximation and smooth (twice-differentiable) policies, PGQ and its relatives converge almost surely to stationary points of the projected Bellman error under standard two-time-scale stochastic approximation conditions (Lehnert et al., 2015). Similarly, for parametric PGUs, global monotonic improvement can be ensured under appropriately corrected cross-entropy scaling (modified $\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s, a)\right]$ 6) and sufficient exploration (Laroche et al., 2022).

Variance reduction (e.g., SRVR-PG, SRVR-NPG) demonstrably improves sample efficiency, with convergence rates advancing from $\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s, a)\right]$ 7 for naive stochastic PG to $\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s, a)\right]$ 8 or better, contingent on the effective horizon and problem conditioning (Liu et al., 2022).

However, if the state-distribution weighting is inconsistent with the objective (i.e., outer $\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d^{\pi_\theta},\,a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) Q^{\pi_\theta}(s, a)\right]$ 9 weights are omitted), the resultant direction is not a gradient of any function, lacks symmetry (Clairaut–Schwarz condition), and can induce convergence to globally pessimal policies (Nota et al., 2019).

5. PGUs for Control and Continuous-Time Domains

PGUs for the Linear Quadratic Regulator (LQR) and continuous-time systems possess explicit analytic formulas. The LQR cost $d^{\pi_\theta}(s)$ 0 allows for closed-form policy gradients and their natural-gradient or quasi-Newton (Kleinman–Newton) refinements. State feedback gain updates can be conducted using both model-based (indirect) or sample-based (direct covariance parametrizations) (Zhao et al., 6 May 2025, Bu et al., 2020). Key properties include:

Exponential or linear convergence rates under gradient-dominated cost structure.
Regularization improves robustness to noise-induced uncertainty (Zhao et al., 6 May 2025).
Natural or Gauss-Newton PGUs can interpolate or even recover classical policy improvement steps.

In continuous-time RL, martingale representations relate the policy gradient to a value function PDE, with actor–critic methods alternately updating value function and policy via either batch-offline or online (martingale-orthogonal) updates (Jia et al., 2021).

6. Practical and Empirical Considerations

Empirical studies benchmark diverse PGUs across RL domains:

On Baird's counterexample, PGQ and Greedy-GQ converge (Q-learning diverges) (Lehnert et al., 2015).
Cross-entropy-based and direct-parametrization PGUs demonstrate acceleration in exploration-constrained deterministic chains and immunity to "gravity wells" on cliff-like MDPs (Laroche et al., 2022).
In large-scale deep RL (Atari), hybrid PGUs (PGQL) surpass pure PG and Q-learning in both mean and median normalized scores, with reduced variance and increased data efficiency (O'Donoghue et al., 2016).
Black-box optimization via PGUs (PBO) bridges the gap to evolutionary strategies, with neural network parameterizations subsuming traditional ES and CMA-ES on analytic test functions and chaotic system control tasks (Viquerat et al., 2021).

For stability and speed, scaling hyperparameters must be tuned according to form/scale choice; entropy regularization, batch normalization, and off-policy correction (e.g., importance weighting, explicit correction) strongly affect empirical behavior (Gummadi et al., 2022).

7. Open Problems and Future Directions

Despite the maturity of PGU theory and empirics, several open challenges persist:

The ubiquity of "biased" PGU estimators lacking the correct discount weighting (as shown in deep RL baselines) raises foundational questions about their dynamical performance in practice (Nota et al., 2019).
The trade-off between monotonicity and sample complexity in hybrid and cross-entropy PGUs remains active; sharp bounds for deep nonlinear function approximation are still being developed (Liu et al., 2022, Laroche et al., 2022).
Distributed and decentralized PGUs, especially under partial observability and adversarial communication, present open questions in stability and scalability (Aydin et al., 2024).
Extensions of continuous-time formulations to partially observed, input-constrained, or stochastic optimal control remain under exploration (Jia et al., 2021, Bu et al., 2020, Zhao et al., 6 May 2025).

A plausible implication is that modular, parameterized frameworks for PGUs will continue to unify method design across RL, supporting efficient, robust, and scalable autonomous learning. Advances in the analytical understanding of PGUs’ convergence—especially under off-policy, high-dimensional, and nonconvex settings—remain core to further science in RL.

References:

"Policy Gradient Methods for Off-policy Control" (Lehnert et al., 2015)
"Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms" (Laroche et al., 2022)
"Is the Policy Gradient a Gradient?" (Nota et al., 2019)
"An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods" (Liu et al., 2022)
"Policy Gradient-based Algorithms for Continuous-time Linear Quadratic Control" (Bu et al., 2020)
"Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms" (Jia et al., 2021)
"Policy-based optimization: single-step policy gradient method seen as an evolution strategy" (Viquerat et al., 2021)
"Almost Sure Convergence of Networked Policy Gradient over Time-Varying Networks in Markov Potential Games" (Aydin et al., 2024)
"A Parametric Class of Approximate Gradient Updates for Policy Optimization" (Gummadi et al., 2022)
"Policy Gradient Adaptive Control for the LQR: Indirect and Direct Approaches" (Zhao et al., 6 May 2025)
"Combining policy gradient and Q-learning" (O'Donoghue et al., 2016)