Papers
Topics
Authors
Recent
Search
2000 character limit reached

PG-VRER: Variance-Reduced Policy Gradients

Updated 7 February 2026
  • The paper introduces PG-VRER, a variance reduction technique that selectively reuses past transitions to improve gradient estimation and sample efficiency.
  • It employs importance sampling and likelihood ratio clipping to maintain a bias–variance trade-off, ensuring only relevant historical samples are reused.
  • Empirical results demonstrate faster convergence and enhanced stability on benchmarks like CartPole and continuous control tasks.

Policy Gradient with Variance Reduction Experience Replay (PG-VRER) refers to a class of policy optimization algorithms in reinforcement learning that selectively reuses historical experience to reduce the variance of gradient estimation. Instead of treating all past state-action transitions equally, as in standard experience replay (ER), the VRER scheme prioritizes the most relevant samples using principled selection and importance reweighting mechanisms, yielding significantly improved sample efficiency and stable optimization (Zheng et al., 5 Feb 2026, Zheng et al., 2021, Zheng et al., 2022).

1. Conceptual Foundations

Policy gradient (PG) methods update policy parameters in the direction of an estimated gradient of the expected cumulative reward. A persistent challenge for PG algorithms is estimator variance, which slows convergence and impairs stability, especially in high-dimensional or sparse-reward settings. Experience replay traditionally addresses sample inefficiency by aggregating historical data, but uniform sampling from a replay buffer introduces bias or uncontrolled variance when the data-generating policy diverges from the current one.

PG-VRER introduces a selection mechanism for historical samples: only transitions likely to improve the current policy gradient estimate are reused. Importance sampling (IS) corrects for distribution mismatch, and the selection is governed by a bias-variance trade-off, leveraging the fact that variance from off-policy data can be reduced below the on-policy baseline if samples are sufficiently "close" in policy space (Zheng et al., 2021, Zheng et al., 5 Feb 2026, Zheng et al., 2022).

2. Mathematical Formulation

Let πθ(as)\pi_\theta(a|s) be a stochastic policy parameterized by θ\theta. The standard on-policy policy-gradient objective is: J(θ)=E(s,a)dπθ[Aπθ(s,a)θlogπθ(as)].\nabla J(\theta) = \mathbb{E}_{(s,a)\sim d^{\pi_\theta}} \big[A^{\pi_\theta}(s,a)\, \nabla_\theta \log \pi_\theta(a \mid s)\big]. where dπθd^{\pi_\theta} is the state-action occupancy distribution, Aπθ(s,a)A^{\pi_\theta}(s,a) is the advantage function.

In PG-VRER, data collected under previous behavior policies {πθi}i<k\{\pi_{\theta_i}\}_{i < k} are selectively reused to estimate J(θk)\nabla J(\theta_k). For each stored transition (s(i,j),a(i,j))(s^{(i, j)}, a^{(i, j)}) from policy πθi\pi_{\theta_i}, define the likelihood ratio: fi,k(s,a):=πθk(as)πθi(as)f_{i,k}(s,a) := \frac{\pi_{\theta_k}(a|s)}{\pi_{\theta_i}(a|s)} The individual likelihood-ratio (LR) estimator for each reuse set member is: ^Ji,kLR=1nj=1nfi,k(s(i,j),a(i,j))  g(s(i,j),a(i,j)θk)\widehat{\nabla} J^{LR}_{i,k} = \frac{1}{n} \sum_{j=1}^n f_{i,k}(s^{(i,j)}, a^{(i,j)}) \; g(s^{(i,j)}, a^{(i,j)} \mid \theta_k) where g(s,aθk)=Aπθk(s,a)θklogπθk(as)g(s,a|\theta_k) = A^{\pi_{\theta_k}}(s,a) \nabla_{\theta_k} \log \pi_{\theta_k}(a|s).

To obtain the final VRER estimator for iteration kk: ^JkR=1UkθiUk^Ji,kR,R{LR,CLR}\widehat{\nabla} J^R_k = \frac{1}{|U_k|} \sum_{\theta_i \in U_k} \widehat{\nabla} J^R_{i,k}, \qquad R \in \{\mathrm{LR}, \mathrm{CLR}\} where UkU_k is the selected "reuse set" of past policies. The Clipped-LR (CLR) variant truncates fi,kf_{i,k} at Uf>1U_f > 1 to control variance at the expense of an additional bias (Zheng et al., 5 Feb 2026, Zheng et al., 2021, Zheng et al., 2022).

3. Selective Experience Reuse and Selection Criteria

The core of VRER is the adaptive construction of UkU_k. At each update, only those past policies πθi\pi_{\theta_i} whose associated individual LR estimator variance does not exceed a pre-specified multiple c>1c>1 of the on-policy estimator variance are permitted: Var(^Ji,kR)cVar(^JkPG)\operatorname{Var}(\widehat{\nabla} J^R_{i,k}) \leq c\,\operatorname{Var}(\widehat{\nabla} J^{PG}_k) An efficient surrogate based on the policy KL divergence can be used: Es[KL(πθk(s)πθi(s))]log(1+(c1)ζk/(ζk+1))\mathbb{E}_s \Big[\mathrm{KL}\big(\pi_{\theta_k}(\cdot|s) \| \pi_{\theta_i}(\cdot|s)\big)\Big] \leq \log\big(1 + (c-1)\zeta_k / (\zeta_k + 1)\big) where ζk\zeta_k is the estimated relative variance of the current PG estimator, computable from optimizer moments (e.g., via Adam) (Zheng et al., 5 Feb 2026, Zheng et al., 2021). This ensures only "close" historical samples—those likely to yield variance reduction—are reused.

4. Algorithmic Implementation

A canonical PG-VRER loop is structured as follows (Zheng et al., 5 Feb 2026, Zheng et al., 2021, Zheng et al., 2022):

  1. Data Collection: Generate nn transitions under πθk\pi_{\theta_k}, append to the buffer. Maintain buffer capacity BB.
  2. Reuse Set Construction: For each historical θi\theta_i in buffer, check the variance-based selection criterion; construct UkU_k accordingly.
  3. Form Training Batch: Subsample n0nn_0 \ll n transitions from each eligible TiT_i, form the aggregated training set D~k\widetilde{D}_k.
  4. Offline Policy Update: For a fixed number of epochs, perform mini-batch updates using the VRER estimator computed over D~k\widetilde{D}_k:

θk+1θk+ηk^JkR\theta_{k+1} \leftarrow \theta_k + \eta_k \widehat{\nabla} J^R_k

  1. History Management: Add θk+1\theta_{k+1} to buffer, discarding the oldest if necessary.

These steps form the backbone for augmenting standard actor-critic, TRPO, and PPO methods with VRER; enhancements are largely orthogonal to base optimization method details. The mixture-likelihood ratio estimator and per-transition importance weighting can be adapted to partial-trajectory or per-step granularity (Zheng et al., 2022).

5. Bias–Variance Analysis and Convergence Guarantees

VRER explicitizes a fundamental bias–variance trade-off:

  • The variance of the VRER estimator is theoretically reduced by a factor of Uk|U_k| (the size of the reuse set), up to a factor cc determined by the selection rule and inter-policy correlations.
  • Bias is introduced via policy drift (reuse of outdated data), per-step Markovian dependency, and likelihood ratio clipping; these effects are tightly controlled by selection constants, buffer size BB, and the reuse threshold cc:

E[^JkR]J(θk)C3\|\mathbb{E}[\widehat{\nabla} J^R_k] - \nabla J(\theta_k)\| \leq C_3

A finite-time convergence theorem shows that under uniform ergodicity and smoothness assumptions, with learning rate ηk=η1kr\eta_k = \eta_1 k^{-r}, one achieves

1Kk=1KEJ(θk)2=O(K(1r))+O(Kr)+O(BK+logKKr)\frac{1}{K}\sum_{k=1}^K \mathbb{E}\|\nabla J(\theta_k)\|^2 = \mathcal{O}(K^{-(1-r)}) + \mathcal{O}(K^{-r}) + \mathcal{O}\left(\frac{B_K + \log K}{K^r}\right)

where BKB_K is buffer size and r(0,1)r\in(0,1). Buffer size trades bias for variance reduction, and optimal cc is typically close to $1$ ($1.02$–$1.06$ in empirical studies) (Zheng et al., 5 Feb 2026, Zheng et al., 2021).

6. Empirical Properties and Applications

PG-VRER has been implemented in discrete and continuous control benchmarks (CartPole-v1, InvertedPendulum, Hopper, LunarLander), typically as an augmentation to actor-critic, PPO, or TRPO. Key observations include:

  • VRER variants achieve 1.5×1.5 \times2×2 \times faster convergence compared to baselines.
  • Median gradient variance reductions are in the 20%20\%30%30\% range, with run-to-run performance improvements robust to buffer size and cc.
  • On CartPole-v1, final returns improve by up to +100%+100\% (A2C-VRER), +40%+40\% (PPO-VRER), and +3%+3\% (TRPO-VRER); comparable improvements are observed on continuous tasks.

Partial-trajectory (per-step) VRER also shows marked acceleration in more complex settings and is robust to variations in cc across [1.2,2.0][1.2,2.0] (Zheng et al., 2022). Actor-critic and PPO architectures with VRER demonstrate stable value estimation and accelerated learning, confirmed through detailed ablations (Zheng et al., 5 Feb 2026, Zheng et al., 2022, Zheng et al., 2021).

7. Theoretical Connections and Extensions

PG-VRER is strictly more general than classical experience replay, as it adaptively filters based on principled statistical tests. The framework aligns with doubly robust off-policy evaluation (Huang et al., 2019), mixture importance sampling (Zheng et al., 2022), and parameter-based exploration with variance-minimizing baselines (Zhao et al., 2013). VRER can be tuned to enforce sample-efficient trust regions (see gradient-truncation variants (Zhang et al., 2021)), can be combined with Monte Carlo critics or model-based rollouts, and extends seamlessly to overparameterized policy classes, exploiting hidden convexity for optimal sample complexity.

The approach has motivated rigorous analyses of sample dependence in off-policy RL algorithms, yielding a new understanding of the interplay between Markovian dynamics, policy drift, and estimator variance (Zheng et al., 5 Feb 2026, Zheng et al., 2021).


References:

(Zheng et al., 5 Feb 2026) "Variance Reduction Based Experience Replay for Policy Optimization" (Zheng et al., 2021) "Variance Reduction based Experience Replay for Policy Optimization" (Zheng et al., 2022) "Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization" (Huang et al., 2019) "From Importance Sampling to Doubly Robust Policy Gradient" (Zhao et al., 2013) "Efficient Sample Reuse in Policy Gradients with Parameter-based Exploration" (Zhang et al., 2021) "On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Policy Gradient with VRER (PG-VRER).