Variance Reduction Experience Replay (VRER)

Updated 7 February 2026

VRER is a reinforcement learning framework that selectively replays past transitions using variance-based criteria to improve policy gradient estimates.
It employs importance sampling with clipping and a tunable variance threshold to balance bias and variance, ensuring provable convergence.
Empirical evaluations with PPO, TRPO, and actor–critic show faster learning, reduced gradient variance, and improved sample efficiency.

Variance Reduction Experience Replay (VRER) is a principled framework for reinforcement learning (RL) that addresses the inefficiencies of classical experience replay (ER) by actively selecting historical samples based on their contribution to variance reduction in policy gradient estimation. Unlike uniform replay, which reuses all past transitions indiscriminately, VRER introduces a theoretically supported mechanism for sample selection and importance weighting that yields lower gradient variance, improved sample efficiency, and provable convergence guarantees in both finite-sample and infinite-horizon Markovian settings. VRER has been implemented in various policy gradient algorithms (including PPO, TRPO, and actor–critic), demonstrating empirical acceleration and stability improvements across a range of standard RL benchmarks (Zheng et al., 2021, Zheng et al., 5 Feb 2026, Zheng et al., 2022).

1. Motivation and Conceptual Foundations

Classical experience replay buffers past transitions and replays them uniformly during optimization. This uniform approach fails to account for the mismatch between the behavior policy (used at the time of data collection) and the target policy (current policy), leading to inflated variance when using importance sampling (IS) corrections or significant bias when corrections are omitted. Furthermore, on-policy policy gradient methods such as REINFORCE, A2C, PPO, and TRPO forgo most past samples, leading to extremely low sample efficiency, especially in complex stochastic environments (Zheng et al., 5 Feb 2026).

VRER overcomes these limitations by introducing a variance-based selection rule that filters historical trajectories based on the estimated variance inflation they would introduce if reused for the current policy update. Only those past samples whose importance-weighted estimator variance does not exceed a specified threshold relative to the variance of a fresh on-policy batch are retained. This selective mechanism enables significant variance reduction while keeping estimation bias within controlled bounds (Zheng et al., 2021, Zheng et al., 5 Feb 2026).

2. Formal Problem Setting and Core Methodology

VRER operates within the standard framework of infinite-horizon Markov Decision Processes (MDPs) defined by state space $S$ , action space $A$ , transition dynamics $p(s'|s,a)$ , a reward function $r(s,a)$ , and discount factor $\gamma\in(0,1)$ . The optimization objective is to maximize the expected discounted reward

$J(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},a\sim \pi_\theta}[r(s,a)],$

where $\pi_\theta(a|s)$ is the policy parameterized by $\theta$ and $d^{\pi_\theta}(s)$ is the stationary state distribution (Zheng et al., 2022, Zheng et al., 5 Feb 2026).

Policy gradients are estimated as

$\nabla_\theta J(\theta) = \mathbb{E}_{(s,a)\sim \rho_\theta}[\nabla_\theta \log\pi_\theta(a|s)A^{\pi_\theta}(s,a)]$

where $A$ denotes the advantage function. VRER generalizes experience replay by incorporating importance sampling corrections and variance-based sample selection. For each buffer policy $\theta_i$ , the importance-weighted gradient estimator is:

$\widehat{\nabla}J^{LR}_{i,k} = \frac{1}{n}\sum_{j=1}^n w_{i,k}(s^{(i,j)}, a^{(i,j)}) \, g(s^{(i,j)}, a^{(i,j)}; \theta_k),$

where the per-transition weight $w_{i,k}(s,a) = \pi_{\theta_k}(a|s)/\pi_{\theta_i}(a|s)$ and $g(s,a;\theta)$ is the advantage-weighted score (Zheng et al., 2021, Zheng et al., 2022).

To limit extreme variance, clipped likelihood ratios can be applied, $w_{i,k}^{cl}(s,a)=\min\{w_{i,k}(s,a), U_f\}$ , where $U_f$ is a user-controlled upper bound.

3. VRER Selection Rules and Variance Reduction Guarantees

VRER employs explicit variance-based criteria to select which past policies are admissible for reuse in the current optimization step. At each iteration $k$ , a set of candidate past policies $\{\theta_i\}$ is screened, and only those for which

$\operatorname{Var}\left[\widehat{\nabla}J^R_{i,k}\right] \leq c \cdot V^{PG}_k$

are included, with $c > 1$ a tunable threshold and $V^{PG}_k$ denoting the variance of the on-policy estimator (Zheng et al., 5 Feb 2026, Zheng et al., 2021). An approximate criterion replaces the direct variance test with a state-averaged KL divergence bound, exploiting the exponential relationship between $KL(\pi_{\theta_k}\|\pi_{\theta_i})$ and IS variance inflation.

This reuse set enables the formation of an averaged estimator that achieves

$\operatorname{Var}[\widehat{\nabla}J^R_k] \lesssim \frac{c}{|\mathcal{U}_k|} \operatorname{Var}[V^{PG}_k],$

up to correlation effects. The mixture likelihood ratio (MLR) estimator further improves variance properties via the Multiple Importance Sampling (MIS) approach, yielding unbiased gradient estimation and provable variance reduction versus naive IS (Zheng et al., 2022).

4. Algorithmic Structure and Integration with Policy Optimization

VRER is instantiated in the Policy Gradient with Variance Reduction Experience Replay (PG-VRER) algorithm, which augments standard policy gradient routines as follows:

At each iteration, collect $n$ new transitions under the current policy.
Construct a buffer (maximum size $B$ ) of past policies and their sample batches.
Evaluate the variance selection criterion for each buffer policy. Build the reuse set from policies passing the test, and for each, downsample $n_0$ samples to form the training set.
Within an inner loop, perform $K_{off}$ offline updates using mini-batch stochastic optimization. The per-update gradient is computed as a weighted average over the selected transitions from all buffer policies, using IS or clipped IS weights.
Update the replay buffer in FIFO order.

No modification to network architectures or the main optimization objectives is required; VRER operates as a modular sample-selection and reweighting wrapper (Zheng et al., 5 Feb 2026, Zheng et al., 2021). The following table summarizes key hyperparameters and their roles:

Hyperparameter	Purpose	Typical Range
$c$	Variance-selection threshold	1.02–1.10
$B$	Buffer size (# of past policies)	$\approx200$ –$400$ (can scale to $\sim$ 1,000)
$n_0$	Downsamples per selected batch	3–5
$K_{off}$	Offline update epochs	5–10

5. Theoretical Properties and Bias-Variance Trade-off

VRER is supported by finite-time convergence guarantees and an explicit analysis of the bias–variance trade-off in the Markovian RL setting. Reusing transitions from increasingly old policies introduces bias, which is analytically characterized as growing with the policy lag $(k-i)$ and the MDP's mixing time $\phi(t)$ . The expected norm of the policy gradient over $K$ steps satisfies:

$\frac{1}{K} \sum_{k=1}^K \mathbb{E}[ \|\nabla J(\theta_k)\|^2 ] \leq O(K^{-(1-r)}) + O(\phi(n t)) + O(t^2 K^{-r}) + O((B_k + t) K^{-r}) + O( (1/K) \sum \eta_k \bar{\rho}_k )$

where $r \in (0,1)$ is the stepsize decay parameter, $t$ reflects the lag, $B_k$ is buffer size, and $\bar{\rho}_k$ is an average reuse correlation. Variance shrinks proportionally to $1/|\mathcal{U}_k|$ , controlled by $c$ and $B$ , but at the expense of bias from replaying stale samples. The user must select $c$ and $B$ to balance this trade-off given the environment's mixing properties (Zheng et al., 2021, Zheng et al., 5 Feb 2026).

6. Empirical Evaluation and Practical Impact

VRER and PG-VRER have been evaluated on benchmarks including CartPole, Hopper, Inverted Pendulum, LunarLander (OpenAI Gym/PyBullet/MuJoCo environments), and domains such as discrete control and industry-relevant continuous tasks (Zheng et al., 2022, Zheng et al., 2021, Zheng et al., 5 Feb 2026). Across PPO, TRPO, and A2C, VRER consistently yields:

Faster learning curves (earlier attainment of near-optimal returns).
Lower final gradient variance (20–30% reduction for PPO).
Improved sample efficiency (e.g., CartPole: PPO-VRER achieves a $43\%$ higher final reward versus vanilla PPO after 1M steps).
Robustness to changes in $B$ , $c$ , and downsampling $n_0$ , with stability against divergence seen in challenging environments.

Empirical variance diagnostics confirm that the reduction in estimator variance translates into more stable and reliably monotonic performance improvement. Statistical significance is affirmed via confidence intervals over multiple random seeds (Zheng et al., 2021, Zheng et al., 2022).

7. Extensions, Limitations, and Theoretical Connections

VRER is algorithm-agnostic, applicable in any policy gradient setting where a replay buffer and IS correction are available. The framework is extensible:

Mixture models and Multiple Importance Sampling (MIS) provide further robustness to extreme weights.
Use of DICE-style density ratio estimators can eliminate residual bias due to the stationary distribution approximation.
The core variance-screening idea generalizes to Natural Policy Gradient, Soft Actor-Critic (SAC), and kernel methods (Zheng et al., 2022, Han et al., 1 Feb 2025).

Notable limitations include the computational cost of screening all past buffer entries for large $k$ —for which pruning or heuristics may be introduced. The current practice often drops the stationary state ratio in IS calculations for tractability, introducing a controlled approximation.

Connections to U-statistics and resampling theory provide a rigorous underpinning: interpreting replayed minibatch updates as averages over resampled $k$ -tuples, VRER inherits provable variance reduction properties of U- and V-statistics, with the added benefit of reducing computational complexity in kernelized methods from $O(n^3)$ to $O(n^2)$ (Han et al., 1 Feb 2025).

Major References:

"Variance Reduction Based Experience Replay for Policy Optimization" (Zheng et al., 5 Feb 2026, Zheng et al., 2021)
"Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization" (Zheng et al., 2022)
"Variance Reduction via Resampling and Experience Replay" (Han et al., 1 Feb 2025)