PG-VRER: Variance-Reduced Policy Gradients

Updated 7 February 2026

The paper introduces PG-VRER, a variance reduction technique that selectively reuses past transitions to improve gradient estimation and sample efficiency.
It employs importance sampling and likelihood ratio clipping to maintain a bias–variance trade-off, ensuring only relevant historical samples are reused.
Empirical results demonstrate faster convergence and enhanced stability on benchmarks like CartPole and continuous control tasks.

Policy Gradient with Variance Reduction Experience Replay (PG-VRER) refers to a class of policy optimization algorithms in reinforcement learning that selectively reuses historical experience to reduce the variance of gradient estimation. Instead of treating all past state-action transitions equally, as in standard experience replay (ER), the VRER scheme prioritizes the most relevant samples using principled selection and importance reweighting mechanisms, yielding significantly improved sample efficiency and stable optimization (Zheng et al., 5 Feb 2026, Zheng et al., 2021, Zheng et al., 2022).

1. Conceptual Foundations

Policy gradient (PG) methods update policy parameters in the direction of an estimated gradient of the expected cumulative reward. A persistent challenge for PG algorithms is estimator variance, which slows convergence and impairs stability, especially in high-dimensional or sparse-reward settings. Experience replay traditionally addresses sample inefficiency by aggregating historical data, but uniform sampling from a replay buffer introduces bias or uncontrolled variance when the data-generating policy diverges from the current one.

PG-VRER introduces a selection mechanism for historical samples: only transitions likely to improve the current policy gradient estimate are reused. Importance sampling (IS) corrects for distribution mismatch, and the selection is governed by a bias-variance trade-off, leveraging the fact that variance from off-policy data can be reduced below the on-policy baseline if samples are sufficiently "close" in policy space (Zheng et al., 2021, Zheng et al., 5 Feb 2026, Zheng et al., 2022).

2. Mathematical Formulation

Let $\pi_\theta(a|s)$ be a stochastic policy parameterized by $\theta$ . The standard on-policy policy-gradient objective is: $\nabla J(\theta) = \mathbb{E}_{(s,a)\sim d^{\pi_\theta}} \big[A^{\pi_\theta}(s,a)\, \nabla_\theta \log \pi_\theta(a \mid s)\big].$ where $d^{\pi_\theta}$ is the state-action occupancy distribution, $A^{\pi_\theta}(s,a)$ is the advantage function.

In PG-VRER, data collected under previous behavior policies $\{\pi_{\theta_i}\}_{i < k}$ are selectively reused to estimate $\nabla J(\theta_k)$ . For each stored transition $(s^{(i, j)}, a^{(i, j)})$ from policy $\pi_{\theta_i}$ , define the likelihood ratio: $f_{i,k}(s,a) := \frac{\pi_{\theta_k}(a|s)}{\pi_{\theta_i}(a|s)}$ The individual likelihood-ratio (LR) estimator for each reuse set member is: $\widehat{\nabla} J^{LR}_{i,k} = \frac{1}{n} \sum_{j=1}^n f_{i,k}(s^{(i,j)}, a^{(i,j)}) \; g(s^{(i,j)}, a^{(i,j)} \mid \theta_k)$ where $g(s,a|\theta_k) = A^{\pi_{\theta_k}}(s,a) \nabla_{\theta_k} \log \pi_{\theta_k}(a|s)$ .

To obtain the final VRER estimator for iteration $k$ : $\widehat{\nabla} J^R_k = \frac{1}{|U_k|} \sum_{\theta_i \in U_k} \widehat{\nabla} J^R_{i,k}, \qquad R \in \{\mathrm{LR}, \mathrm{CLR}\}$ where $U_k$ is the selected "reuse set" of past policies. The Clipped-LR (CLR) variant truncates $f_{i,k}$ at $U_f > 1$ to control variance at the expense of an additional bias (Zheng et al., 5 Feb 2026, Zheng et al., 2021, Zheng et al., 2022).

3. Selective Experience Reuse and Selection Criteria

The core of VRER is the adaptive construction of $U_k$ . At each update, only those past policies $\pi_{\theta_i}$ whose associated individual LR estimator variance does not exceed a pre-specified multiple $c>1$ of the on-policy estimator variance are permitted: $\operatorname{Var}(\widehat{\nabla} J^R_{i,k}) \leq c\,\operatorname{Var}(\widehat{\nabla} J^{PG}_k)$ An efficient surrogate based on the policy KL divergence can be used: $\mathbb{E}_s \Big[\mathrm{KL}\big(\pi_{\theta_k}(\cdot|s) \| \pi_{\theta_i}(\cdot|s)\big)\Big] \leq \log\big(1 + (c-1)\zeta_k / (\zeta_k + 1)\big)$ where $\zeta_k$ is the estimated relative variance of the current PG estimator, computable from optimizer moments (e.g., via Adam) (Zheng et al., 5 Feb 2026, Zheng et al., 2021). This ensures only "close" historical samples—those likely to yield variance reduction—are reused.

4. Algorithmic Implementation

A canonical PG-VRER loop is structured as follows (Zheng et al., 5 Feb 2026, Zheng et al., 2021, Zheng et al., 2022):

Data Collection: Generate $n$ transitions under $\pi_{\theta_k}$ , append to the buffer. Maintain buffer capacity $B$ .
Reuse Set Construction: For each historical $\theta_i$ in buffer, check the variance-based selection criterion; construct $U_k$ accordingly.
Form Training Batch: Subsample $n_0 \ll n$ transitions from each eligible $T_i$ , form the aggregated training set $\widetilde{D}_k$ .
Offline Policy Update: For a fixed number of epochs, perform mini-batch updates using the VRER estimator computed over $\widetilde{D}_k$ :

$\theta_{k+1} \leftarrow \theta_k + \eta_k \widehat{\nabla} J^R_k$

History Management: Add $\theta_{k+1}$ to buffer, discarding the oldest if necessary.

These steps form the backbone for augmenting standard actor-critic, TRPO, and PPO methods with VRER; enhancements are largely orthogonal to base optimization method details. The mixture-likelihood ratio estimator and per-transition importance weighting can be adapted to partial-trajectory or per-step granularity (Zheng et al., 2022).

5. Bias–Variance Analysis and Convergence Guarantees

VRER explicitizes a fundamental bias–variance trade-off:

The variance of the VRER estimator is theoretically reduced by a factor of $|U_k|$ (the size of the reuse set), up to a factor $c$ determined by the selection rule and inter-policy correlations.
Bias is introduced via policy drift (reuse of outdated data), per-step Markovian dependency, and likelihood ratio clipping; these effects are tightly controlled by selection constants, buffer size $B$ , and the reuse threshold $c$ :

$\|\mathbb{E}[\widehat{\nabla} J^R_k] - \nabla J(\theta_k)\| \leq C_3$

A finite-time convergence theorem shows that under uniform ergodicity and smoothness assumptions, with learning rate $\eta_k = \eta_1 k^{-r}$ , one achieves

$\frac{1}{K}\sum_{k=1}^K \mathbb{E}\|\nabla J(\theta_k)\|^2 = \mathcal{O}(K^{-(1-r)}) + \mathcal{O}(K^{-r}) + \mathcal{O}\left(\frac{B_K + \log K}{K^r}\right)$

where $B_K$ is buffer size and $r\in(0,1)$ . Buffer size trades bias for variance reduction, and optimal $c$ is typically close to $1$ ($1.02$–$1.06$ in empirical studies) (Zheng et al., 5 Feb 2026, Zheng et al., 2021).

6. Empirical Properties and Applications

PG-VRER has been implemented in discrete and continuous control benchmarks (CartPole-v1, InvertedPendulum, Hopper, LunarLander), typically as an augmentation to actor-critic, PPO, or TRPO. Key observations include:

VRER variants achieve $1.5 \times$ – $2 \times$ faster convergence compared to baselines.
Median gradient variance reductions are in the $20\%$ – $30\%$ range, with run-to-run performance improvements robust to buffer size and $c$ .
On CartPole-v1, final returns improve by up to $+100\%$ (A2C-VRER), $+40\%$ (PPO-VRER), and $+3\%$ (TRPO-VRER); comparable improvements are observed on continuous tasks.

Partial-trajectory (per-step) VRER also shows marked acceleration in more complex settings and is robust to variations in $c$ across $[1.2,2.0]$ (Zheng et al., 2022). Actor-critic and PPO architectures with VRER demonstrate stable value estimation and accelerated learning, confirmed through detailed ablations (Zheng et al., 5 Feb 2026, Zheng et al., 2022, Zheng et al., 2021).

7. Theoretical Connections and Extensions

PG-VRER is strictly more general than classical experience replay, as it adaptively filters based on principled statistical tests. The framework aligns with doubly robust off-policy evaluation (Huang et al., 2019), mixture importance sampling (Zheng et al., 2022), and parameter-based exploration with variance-minimizing baselines (Zhao et al., 2013). VRER can be tuned to enforce sample-efficient trust regions (see gradient-truncation variants (Zhang et al., 2021)), can be combined with Monte Carlo critics or model-based rollouts, and extends seamlessly to overparameterized policy classes, exploiting hidden convexity for optimal sample complexity.

The approach has motivated rigorous analyses of sample dependence in off-policy RL algorithms, yielding a new understanding of the interplay between Markovian dynamics, policy drift, and estimator variance (Zheng et al., 5 Feb 2026, Zheng et al., 2021).

References:

(Zheng et al., 5 Feb 2026) "Variance Reduction Based Experience Replay for Policy Optimization" (Zheng et al., 2021) "Variance Reduction based Experience Replay for Policy Optimization" (Zheng et al., 2022) "Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization" (Huang et al., 2019) "From Importance Sampling to Doubly Robust Policy Gradient" (Zhao et al., 2013) "Efficient Sample Reuse in Policy Gradients with Parameter-based Exploration" (Zhang et al., 2021) "On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method"