PG-VRER: Variance-Reduced Policy Gradients
- The paper introduces PG-VRER, a variance reduction technique that selectively reuses past transitions to improve gradient estimation and sample efficiency.
- It employs importance sampling and likelihood ratio clipping to maintain a bias–variance trade-off, ensuring only relevant historical samples are reused.
- Empirical results demonstrate faster convergence and enhanced stability on benchmarks like CartPole and continuous control tasks.
Policy Gradient with Variance Reduction Experience Replay (PG-VRER) refers to a class of policy optimization algorithms in reinforcement learning that selectively reuses historical experience to reduce the variance of gradient estimation. Instead of treating all past state-action transitions equally, as in standard experience replay (ER), the VRER scheme prioritizes the most relevant samples using principled selection and importance reweighting mechanisms, yielding significantly improved sample efficiency and stable optimization (Zheng et al., 5 Feb 2026, Zheng et al., 2021, Zheng et al., 2022).
1. Conceptual Foundations
Policy gradient (PG) methods update policy parameters in the direction of an estimated gradient of the expected cumulative reward. A persistent challenge for PG algorithms is estimator variance, which slows convergence and impairs stability, especially in high-dimensional or sparse-reward settings. Experience replay traditionally addresses sample inefficiency by aggregating historical data, but uniform sampling from a replay buffer introduces bias or uncontrolled variance when the data-generating policy diverges from the current one.
PG-VRER introduces a selection mechanism for historical samples: only transitions likely to improve the current policy gradient estimate are reused. Importance sampling (IS) corrects for distribution mismatch, and the selection is governed by a bias-variance trade-off, leveraging the fact that variance from off-policy data can be reduced below the on-policy baseline if samples are sufficiently "close" in policy space (Zheng et al., 2021, Zheng et al., 5 Feb 2026, Zheng et al., 2022).
2. Mathematical Formulation
Let be a stochastic policy parameterized by . The standard on-policy policy-gradient objective is: where is the state-action occupancy distribution, is the advantage function.
In PG-VRER, data collected under previous behavior policies are selectively reused to estimate . For each stored transition from policy , define the likelihood ratio: The individual likelihood-ratio (LR) estimator for each reuse set member is: where .
To obtain the final VRER estimator for iteration : where is the selected "reuse set" of past policies. The Clipped-LR (CLR) variant truncates at to control variance at the expense of an additional bias (Zheng et al., 5 Feb 2026, Zheng et al., 2021, Zheng et al., 2022).
3. Selective Experience Reuse and Selection Criteria
The core of VRER is the adaptive construction of . At each update, only those past policies whose associated individual LR estimator variance does not exceed a pre-specified multiple of the on-policy estimator variance are permitted: An efficient surrogate based on the policy KL divergence can be used: where is the estimated relative variance of the current PG estimator, computable from optimizer moments (e.g., via Adam) (Zheng et al., 5 Feb 2026, Zheng et al., 2021). This ensures only "close" historical samples—those likely to yield variance reduction—are reused.
4. Algorithmic Implementation
A canonical PG-VRER loop is structured as follows (Zheng et al., 5 Feb 2026, Zheng et al., 2021, Zheng et al., 2022):
- Data Collection: Generate transitions under , append to the buffer. Maintain buffer capacity .
- Reuse Set Construction: For each historical in buffer, check the variance-based selection criterion; construct accordingly.
- Form Training Batch: Subsample transitions from each eligible , form the aggregated training set .
- Offline Policy Update: For a fixed number of epochs, perform mini-batch updates using the VRER estimator computed over :
- History Management: Add to buffer, discarding the oldest if necessary.
These steps form the backbone for augmenting standard actor-critic, TRPO, and PPO methods with VRER; enhancements are largely orthogonal to base optimization method details. The mixture-likelihood ratio estimator and per-transition importance weighting can be adapted to partial-trajectory or per-step granularity (Zheng et al., 2022).
5. Bias–Variance Analysis and Convergence Guarantees
VRER explicitizes a fundamental bias–variance trade-off:
- The variance of the VRER estimator is theoretically reduced by a factor of (the size of the reuse set), up to a factor determined by the selection rule and inter-policy correlations.
- Bias is introduced via policy drift (reuse of outdated data), per-step Markovian dependency, and likelihood ratio clipping; these effects are tightly controlled by selection constants, buffer size , and the reuse threshold :
A finite-time convergence theorem shows that under uniform ergodicity and smoothness assumptions, with learning rate , one achieves
where is buffer size and . Buffer size trades bias for variance reduction, and optimal is typically close to $1$ ($1.02$–$1.06$ in empirical studies) (Zheng et al., 5 Feb 2026, Zheng et al., 2021).
6. Empirical Properties and Applications
PG-VRER has been implemented in discrete and continuous control benchmarks (CartPole-v1, InvertedPendulum, Hopper, LunarLander), typically as an augmentation to actor-critic, PPO, or TRPO. Key observations include:
- VRER variants achieve – faster convergence compared to baselines.
- Median gradient variance reductions are in the – range, with run-to-run performance improvements robust to buffer size and .
- On CartPole-v1, final returns improve by up to (A2C-VRER), (PPO-VRER), and (TRPO-VRER); comparable improvements are observed on continuous tasks.
Partial-trajectory (per-step) VRER also shows marked acceleration in more complex settings and is robust to variations in across (Zheng et al., 2022). Actor-critic and PPO architectures with VRER demonstrate stable value estimation and accelerated learning, confirmed through detailed ablations (Zheng et al., 5 Feb 2026, Zheng et al., 2022, Zheng et al., 2021).
7. Theoretical Connections and Extensions
PG-VRER is strictly more general than classical experience replay, as it adaptively filters based on principled statistical tests. The framework aligns with doubly robust off-policy evaluation (Huang et al., 2019), mixture importance sampling (Zheng et al., 2022), and parameter-based exploration with variance-minimizing baselines (Zhao et al., 2013). VRER can be tuned to enforce sample-efficient trust regions (see gradient-truncation variants (Zhang et al., 2021)), can be combined with Monte Carlo critics or model-based rollouts, and extends seamlessly to overparameterized policy classes, exploiting hidden convexity for optimal sample complexity.
The approach has motivated rigorous analyses of sample dependence in off-policy RL algorithms, yielding a new understanding of the interplay between Markovian dynamics, policy drift, and estimator variance (Zheng et al., 5 Feb 2026, Zheng et al., 2021).
References:
(Zheng et al., 5 Feb 2026) "Variance Reduction Based Experience Replay for Policy Optimization" (Zheng et al., 2021) "Variance Reduction based Experience Replay for Policy Optimization" (Zheng et al., 2022) "Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization" (Huang et al., 2019) "From Importance Sampling to Doubly Robust Policy Gradient" (Zhao et al., 2013) "Efficient Sample Reuse in Policy Gradients with Parameter-based Exploration" (Zhang et al., 2021) "On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method"