Doubly Robust Estimator in Causal Inference

Updated 16 January 2026

Doubly robust estimator is a semiparametric method that uses both outcome regression and propensity score models to ensure unbiased estimation if either is correctly specified.
It applies advanced techniques like sample-splitting and martingale central limit theory to address challenges in adaptive experiments and dependent data.
Empirical results show that adaptive DR estimators improve finite-sample stability and efficiency through regularization and cross-fitting methods.

A doubly robust (DR) estimator is a semiparametric technique that achieves consistency for a target parameter when either of two working models (“nuisance parameters”) is correctly specified, but not necessarily both. In modern causal inference, DR estimators are central tools for unbiased policy evaluation, semiparametric estimation, and inference in settings where the outcome and treatment assignment mechanisms may be complex or adaptively learned. The construction and analysis of DR estimators is deeply entwined with efficient influence-function theory, sample-splitting, and orthogonality of estimating equations. Current research explores their application to adaptive experiments, high-dimensional models, off-policy evaluation in bandits, and complex observational designs.

1. Classical Doubly Robust Construction and Properties

The canonical DR estimator is formulated in the context of estimating a functional of the form

$R(e) = E_X\left[\sum_{a=1}^K e(a|X) Y(a)\right]$

where $e(\cdot|x)$ is an evaluation policy, $\pi(a|x)$ is a logging policy (behavior), and $\mu(a,x)$ is an outcome regression.

The classical DR estimator is

$\hat R_{DR} = \frac{1}{n}\sum_{i=1}^n \sum_{a=1}^K \left[\frac{e(a|X_i) \mathbf{1}\{A_i=a\} (Y_i-\hat\mu(a,X_i))}{\pi(a|X_i)} + e(a|X_i)\hat\mu(a,X_i)\right]$

It is termed “doubly robust” because consistency for $R(e)$ holds if either the outcome regression $\hat\mu$ or the propensity model $\pi$ is correctly specified; if both are, asymptotic efficiency is attained. The estimator is unbiased under either scenario:

If $\hat\mu=\mu$ is correct, the weighted residual term vanishes in expectation due to the matching of $\mathbf{1}\{A=a\}$ and $\pi(a|X)$ .
If $\pi$ is correct, plugging $E[Y|A=a,X]$ into the regression ensures unbiasedness for $R(e)$ .

2. Adaptive Experiments and Dependent Data: The ADR Estimator

In adaptive experiments, the logging policy $\pi_t(a|x)$ evolves with time, based on past data. Here, observations are dependent, and classical i.i.d.-based DR theory does not directly apply. The Adaptive Doubly Robust (ADR) estimator addresses these challenges by adopting a time-indexed sample-splitting approach:

At time $t$ , fit nuisance functions $\hat\mu_{t-1}(a,x)$ and $\hat\pi_{t-1}(a|x)$ using history $\Omega_{t-1}$ .
Compute the pseudo-outcome

$\phi_t = \sum_{a=1}^K \left[\frac{e(a|X_t)\mathbf{1}\{A_t=a\}(Y_t-\hat\mu_{t-1}(a,X_t))}{\hat\pi_{t-1}(a|X_t)} + e(a|X_t)\hat\mu_{t-1}(a,X_t)\right]$

Aggregate to form

$\widehat R^{ADR}_T = \frac{1}{T} \sum_{t=1}^T \phi_t$

Martingale central limit theory and semiparametric arguments show that, under uniform overlap, consistent nuisance estimation rates, and regularity, the estimator is asymptotically normal:

$\sqrt{T}(\widehat R^{ADR}_T - R(e)) \xrightarrow{d} N(0,\Psi(e,\tilde\pi))$

where $\Psi(e,\tilde\pi)$ is the influence function variance. Sample-splitting over time mitigates violation of Donsker conditions by modern learners, establishing valid inference even with dependent, adaptively-generated samples (Kato et al., 2020).

3. Efficiency, Rate Conditions, and Martingale Theory

DR estimators achieve semiparametric efficiency provided both outcome regression and propensity components are consistently estimated. In the ADR construction, bias decomposition shows that nuisance estimation errors interact multiplicatively; under bounded outcomes and uniform overlap, the product of $L_2$ errors in $\hat\mu$ and $\hat\pi$ must satisfy:

$\|\hat\mu_{t-1}-\mu\|_2 \cdot \|\hat\pi_{t-1}-\pi_{t-1}\|_2 = o_p(T^{-1/2})$

so that bias vanishes asymptotically. This “product-rate” principle holds for i.i.d. and dependent samples alike and licenses the use of machine learning algorithms with slower than parametric rates, provided cross-fitting or sample-splitting is employed.

The influence function for the ADR estimator is a martingale increment:

$\phi_t - E[\phi_t | \Omega_{t-1}]$

which permits standard variance computation and confidence interval construction.

4. Finite-Sample Phenomena and the Logging-Policy Paradox

Simulation studies demonstrate that the ADR estimator using an estimated logging policy $\hat\pi_{t-1}$ often exhibits substantially lower mean squared error (MSE) than “AIPW” estimators employing the true policy $\pi_t$ . This arises because early in adaptive experiments, the true $\pi_t(a|x)$ can be near 0 or 1 for certain actions, causing instability in IPW weights. Regularized or truncated estimates $\hat\pi_{t-1}(a|x)$ remain bounded, reducing variance. Asymptotic efficiency theory does not resolve this paradox; the limiting variances of ADR and AIPW are identical, but ADR is more stable in finite samples. This phenomenon mirrors earlier findings for propensity-score estimation in i.i.d. designs but is distinctively acute for dependent, adaptively-collected data (Kato et al., 2020).

5. Implementation and Practical Recommendations

DR estimation requires careful selection of nuisance learners with known $L_2$ error rates. For logging policy models, logistic regression or gradient-boosted trees are commonly used, with regularization or truncation to maintain overlap. For outcome regressions, random forests, neural networks, or kernel methods are suitable. Interval estimation employs plug-in influence-function variance:

$\hat\sigma^2 = \frac{1}{T} \sum_{t=1}^T (\phi_t - \bar\phi)^2,\quad \bar\phi = T^{-1} \sum_{t=1}^T \phi_t$

yielding $R^{ADR} \pm z_{1-\alpha/2} \hat\sigma/\sqrt{T}$ intervals.

Key conditions:

Logging probabilities must stabilize (i.e., $\pi_t(a|x) \to \tilde\pi(a|x)$ ).
The product of nuisance estimation rates must satisfy $p+q \ge 1/2$ .

Sample-splitting, cross-fitting, and regularization are central to robust practical performance.

6. Extensions and Connections to Modern Theory

The doubly robust paradigm recurs throughout modern semiparametric theory, connecting to orthogonal estimating equations, efficient score construction, and modern sample-splitting/cross-fitting analysis [Chernozhukov et al., 2018]. ADR estimators have close analogues in double/debiased machine learning, adaptive causal inference, and off-policy evaluation. The techniques extend naturally to high-dimensional covariate settings, adaptive bandit models, policy evaluation under dependence, and model selection environments.

7. Summary Table: Classical DR vs. ADR in Adaptive Experiments

Estimator	Setting	Asymptotic Normality	Finite Sample Stability	Learners for Nuisances
Classical DR	i.i.d.	standard CLT	unstable if $\pi$ near 0 or 1	logistic regression, trees, forests
ADR	adaptive/dependent	martingale CLT with sample-splitting	stable via $\hat\pi$ , regularization	same, requires adaptive-fitting

ADR estimators unify the theory of DR inference in adaptive and dependent data settings and empirically outperform classical AIPW approaches under time-varying policies, demonstrating robustness, efficiency, and finite-sample advantages (Kato et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

The Adaptive Doubly Robust Estimator for Policy Evaluation in Adaptive Experiments and a Paradox Concerning Logging Policy (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Doubly Robust Estimator.