DR-PG: Doubly Robust Policy Gradient

Updated 16 January 2026

DR-PG is a family of methods that combines model-based reward predictions with importance sampling corrections to yield unbiased, low-variance policy gradients.
It extends traditional contextual bandit techniques to both stochastic and deterministic reinforcement learning settings using advanced kernelization and marginalization strategies.
Empirical studies show DR-PG achieves substantial reductions in gradient variance and improved sample efficiency, outperforming standard IPS and direct methods in various domains.

Doubly Robust Policy Gradient (DR-PG) is a family of policy optimization methodologies that integrate doubly robust (DR) estimation with policy gradient algorithms in both contextual bandits and reinforcement learning. DR estimation leverages both model-based reward predictions and importance-weighted corrections to yield unbiased, low-variance estimates of policy value and its gradient, even under substantial model misspecification. DR-PG constitutes a major advance in off-policy evaluation and optimization, consistently outperforming prior approaches based solely on direct methods or inverse propensity/importance sampling. Recent developments extend DR-PG from finite-action bandits to stochastic and deterministic continuous-control settings, using kernelization and marginalization to overcome intrinsic density-ratio obstacles.

1. Principles of Doubly Robust Estimation

Doubly robust (DR) estimators combine two sources of information: (1) a direct model for the expected reward (regression or Q-function approximation), and (2) an importance sampling (IS) correction using the ratio between target and logging policies. In the contextual bandit setting, given data $S = \{(x_i, h_i, a_i, r_{i,a_i})\}_i$ and a deterministic policy $\pi$ , the DR estimator for value is

$\hat V_{DR}(\pi) = \frac{1}{|S|} \sum_{(x,h,a,r_a) \in S} \left[ (r_a - \hat h_a(x)) I\{\pi(x) = a\} / b(a|x,h) + \hat h_{\pi(x)}(x) \right]$

where $\hat h_a(x)$ is the fitted expected reward for context-action pair, and $b(a|x,h)$ is the estimated logging policy probability. The bias of this estimator is proportional to the product of the errors in the regression and propensity estimates (Theorem 3.1 of (Dudik et al., 2011)), vanishing if either is accurate. Its variance interpolates between high-variance IS and low-bias DM (Dudik et al., 2011, Dudík et al., 2015).

When policy is parameterized and differentiable, the DR estimator can be differentiated to yield a policy gradient with improved statistical properties (Dudík et al., 2015, Huang et al., 2019).

2. DR-PG in Contextual Bandits and Off-Policy Learning

Doubly robust techniques were initially developed for contextual bandits with finite actions (Dudik et al., 2011, Dudík et al., 2015). The parameterized DR estimator for policy $\pi_\theta(a|x)$ is

$\hat V_{DR}(\pi_\theta) = \frac{1}{n} \sum_{k=1}^n \left[ \sum_{a} \pi_\theta(a|x_k) \hat Q(x_k, a) + \frac{\pi_\theta(a_k|x_k)}{b_k}(r_k - \hat Q(x_k, a_k)) \right]$

and its gradient w.r.t. $\theta$ yields the DR policy gradient: $\nabla_\theta \hat V_{DR}(\pi_\theta) = \frac{1}{n} \sum_{k=1}^n \sum_{a} \pi_\theta(a|x_k) \hat Q(x_k,a) \nabla_\theta \log \pi_\theta(a|x_k) + \frac{\pi_\theta(a_k|x_k)}{b_k}(r_k - \hat Q(x_k,a_k)) \nabla_\theta \log \pi_\theta(a_k|x_k)$ This approach enables off-policy learning from logged bandit data using stochastic gradient ascent (Dudík et al., 2015).

Empirically, DR-based policy optimization achieves substantially lower variance and improved performance compared to IPS and DM, consistently outperforming these baselines in multiclass classification, covariate shift, and web-content ranking (Dudik et al., 2011, Dudík et al., 2015).

3. DR-PG in Reinforcement Learning: Stochastic and Deterministic Policies

DR-PG is generalized to full reinforcement learning by combining trajectory-based DR estimators and policy gradient methods (Huang et al., 2019, Islam et al., 2019, Kallus et al., 2020). In episodic MDPs, the doubly robust policy gradient is derived by finite-differencing the DR value estimator: $\hat g = \sum_{t=0}^T \left\{ \nabla_\theta \log \pi_\theta(a_t|s_t) \left[ \sum_{t'=t}^T \gamma^{t'} r_{t'} + \sum_{t'=t+1}^T \gamma^{t'} (V_t'(s_{t'}) - Q_t'(s_{t'}, a_{t'})) \right] + \gamma^t \left( \nabla_\theta V_t'(s_t) - \nabla_\theta Q_t'(s_t, a_t) - Q_t'(s_t, a_t) \nabla_\theta \log \pi_\theta(a_t|s_t) \right) \right\}$ (Huang et al., 2019). The first term generalizes REINFORCE with a telescoping DR correction; the second term captures the gradient of the critic.

For deterministic policies in continuous action spaces, classical IS is ill-defined due to absence of density overlap. Kernelization replaces the Dirac policy with a smoothed kernel: $\pi^{e,K}_{\theta,t}(a_t|s_t) = K_h(a_t - \tau_{\theta,t}(s_t))$ and the kernelized doubly robust policy gradient (K-DRPG) uses derivatives of the kernel density to enable off-policy gradient estimation while attaining optimal statistical rates, e.g., MSE $O(n^{-4/7})$ independent of horizon under proper marginalization (Kallus et al., 2020).

4. Actor-Critic DR-PG Algorithms and Implementation

DR-PG principles are applied in actor-critic frameworks by constructing doubly robust TD targets for critic evaluation: $DR(s, a, s') = \hat Q(s, \pi(s); \psi_{QV}) + \rho(s, a)\left[ r(s, a) + \gamma \hat Q(s', \pi(s'); \psi_{QV}) - \hat Q(s, a; \psi_{QV}) \right]$ where $\rho(s, a)$ is the importance ratio, and $\hat Q, \hat V$ are regression-based estimates. The policy is then updated via standard deterministic or stochastic gradient rules, replacing Q values by the DR critic (Islam et al., 2019). In practical terms, training proceeds via periodic minibatch gradient steps over replay buffer, sequentially updating reward model, critic, and policy according to respective DR targets.

Pseudocode for DR-PG in contextual bandits (Dudík et al., 2015) and actor-critic RL (Islam et al., 2019) follows standard minibatch SGD templates, with DR estimator or DR target as the gradient signal. For deterministic RL, kernel-based DR-PG requires additional nuisance function estimation (reward models, density ratios) and kernel bandwidth optimization (Kallus et al., 2020).

5. Theoretical Guarantees: Bias, Variance, and Sample Complexity

DR estimators preserve the doubly robust property: unbiased if either the reward or the propensity model is correct. Theoretical bounds in (Dudik et al., 2011) and (Dudík et al., 2015) show

$|\mathbb{E}[\hat V_{DR}] - V^\pi| \leq \mathbb{E}_x[|\Delta(\pi(x),x)| \cdot |\delta(\pi(x),x)|]$

and variance scales as a mixture of DM and IS variance components.

For DR-PG, variance analysis in (Huang et al., 2019) yields the covariance matrix explicitly, and shows that perfect knowledge of Q and its gradient achieves the Cramér–Rao lower bound in tree-structured MDPs. Kernelized DR-PG and marginalized variants show that sample complexity is optimal up to the order $O(n^{-4/7})$ , independent of RL horizon, effectively breaking the curse of horizon (Kallus et al., 2020).

6. Empirical Performance, Robustness, and Extensions

Empirical studies on classification, web search, and continuous control demonstrate that DR-PG consistently reduces gradient variance by 60–80% relative to policy gradient baselines, yields lower RMSE in evaluation, and improves sample efficiency (Dudik et al., 2011, Dudík et al., 2015, Huang et al., 2019, Islam et al., 2019). In MuJoCo RL tasks, DR-actor-critic algorithms reach target reward thresholds faster and with greater stability than standard methods. Kernelized DR-PG outperforms kernelized ISPG in both bias and variance (Kallus et al., 2020).

DR-PG exhibits superior robustness under corrupted reward signals: the regression (model-based) component filters stochastic noise, maintaining stable learning and outperforming baselines under strong perturbations (Islam et al., 2019). Extensions include higher-order DR, trajectory DR control variates, and integration of model-based critic gradients for further variance reduction (Huang et al., 2019). Recent work identifies finite-difference connections between OPE estimators and PG algorithms, enabling systematic construction of new control variates.

7. Practical Considerations and Recommendations

Implementation of DR-PG demands careful nuisance modeling (Q-function, reward, and density ratio estimation), kernel/bandwidth selection in deterministic continuous-control cases, and mitigation against large importance weights (weight clipping). Cross-fitting strategies are recommended to avoid overfitting in nuisance estimates (Kallus et al., 2020). Computational cost scales with number of samples and horizon, with marginalization and kernel integration costing $O(nH)$ per update. In practice, DR-PG is especially advantageous in offline, partial-feedback, or reward-corrupted regimes where classical IS or direct methods are ineffective.

Relevant works for further study include Dudík et al. (2011, 2014), Huang & Jiang (2020), Kallus & Uehara (2020), and Zhang et al. (2019) (Dudik et al., 2011, Dudík et al., 2015, Huang et al., 2019, Islam et al., 2019, Kallus et al., 2020).