Proximal Off-Policy Evaluation

Updated 15 December 2025

Proximal off-policy evaluation is a set of techniques that estimate the value of target policies in reinforcement learning by leveraging proxy variables to address unobserved confounding.
It employs recursively defined bridge functions and nonparametric identification methods to resolve integral equations and mitigate biases in partially observed systems.
Recursive estimation algorithms using NPIV regression and regularization yield finite-sample error bounds and achieve semiparametric efficiency under practical constraints.

The proximal off-policy evaluation (OPE) approach constitutes a class of techniques for policy value estimation in reinforcement learning (RL) when data are collected under a behavior policy potentially confounded by unobserved state variables. Traditional off-policy evaluation methods assume full observability, yet real-world domains—particularly in healthcare, education, and decision support—frequently present unmeasured confounding and partially observed Markov decision processes (POMDPs). Proximal OPE leverages proxy variables and nonparametric identification theories from proximal causal inference to provide robust, theoretically justified estimation procedures in POMDPs and Markov decision processes with unobserved confounders.

1. Formal Problem Setup and Motivation

Proximal OPE seeks to estimate the value of a target policy $\pi$ using trajectory data generated from a possibly confounded behavior policy $\mu$ , where the underlying environment is a POMDP or MDP with latent confounders. At each time $t$ , the process is described by:

Latent state $U_t \in \mathcal{U}$
Observed proxy or partial state $S_t \in \mathcal{S}$
Action $A_t \in \mathcal{A}$
Reward $R_t \in [-1,1]$ (episodic POMDP), $R_t \in \mathbb{R}$ in general

Data are sampled as $(S_t, A_t, R_t)$ or, more generally, as full trajectories, without observations of $U_t$ . The goal is to evaluate

$V^\pi = \mathbb{E}_{\mu}\Bigl[\sum_{t=1}^T R_t\Bigr]$

where actions are replaced by those from $\pi$ in the evaluation. Standard OPE approaches are biased in this setting due to confounding by $U_t$ . The proximal approach introduces time-dependent proxy variables—usually, an action-inducing $Z_t$ and a reward-inducing $W_t$ —that are constructed to "break" confounding through specific conditional independence relations (Miao et al., 2022, Bennett et al., 2021, Bennett et al., 2020).

2. Proximal Identification and Bridge Functions

The cornerstone of the proximal OPE framework is the identification of the policy value by a sequence of recursively-defined bridge functions. In the episodic POMDP case, this involves the $Q$ -bridge functions $q_t : \mathcal{W} \times \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ and $V$ -bridge functions $v_t : \mathcal{W} \times \mathcal{S} \to \mathbb{R}$ satisfying

$\mathbb{E}_\mu \left[ q_t(W_t, S_t, A_t) \mid U_t=s_u, S_t=s, A_t=a \right] = \mathbb{E}_\pi\left[ \textstyle\sum_{t'=t}^T R_{t'} \mid U_t=s_u, S_t=s, A_t=a \right]$

and

$v_t(w, s) = \sum_{a\in\mathcal{A}} \pi_t(a\mid s)\,q_t(w,s,a)\,.$

The $Q$ -bridge functions satisfy linear integral equations in observed variables and proxies,

$\mathbb{E}_\mu\left[q_t(W_t, S_t, A_t) \mid Z_t, S_t, A_t \right] = \mathbb{E}_\mu\left[ R_t + v_{t+1}(W_{t+1},S_{t+1}) \mid Z_t, S_t, A_t \right]$

with boundary $v_{T+1} \equiv 0$ . The policy value is ultimately expressed as

$V^\pi = \mathbb{E}_\mu \left[ v_1(W_1, S_1) \right]$

enabling evaluation solely from observed and proxy data (Miao et al., 2022). An analogous framework holds for infinite-horizon settings with stationary distribution ratios and for the doubly robust bridge-based representations (Bennett et al., 2021, Bennett et al., 2020).

3. Key Assumptions and Identification Conditions

Orthodox proximal identification relies on several structural and statistical assumptions:

Markovianity: Transitions in $(U_t, S_t)$ are Markov.
Proxy-Conditional Independences: Observed proxies $W_t$ $W_{t}$ , $Z_t$ $Z_{t}$ are chosen such that, for each $t$ $t$ ,
- $W_t \perp \{A_t,S_{t-1},U_{t-1}\} \mid (S_t,U_t)$
- $Z_t \perp \{R_t,W_t,S_{t+1},U_{t+1}\} \mid (S_t,U_t,A_t)$
Completeness (Nonparametric IV Condition): For all $s,a$ , if $g(U_t) \perp Z_t \mid (S_t=s,A_t=a)$ then $g \equiv 0$ , and similarly for $W_t$ .
Support (Overlap): $\mu_t^b(a\mid s)>0$ whenever $\pi_t(a\mid s)>0$ (Miao et al., 2022, Bennett et al., 2021). These conditions collectively ensure that the relevant integral equations admit unique solutions for the bridge functions and are empirically estimable.

For infinite-horizon settings with i.i.d. confounders, additional stationarity and ergodicity conditions are imposed, and proxies are used to define observable moment equations for the stationary density ratios (Bennett et al., 2020).

4. Estimation Algorithms and Computational Aspects

Estimation proceeds recursively, typically employing a fitted- $Q$ -evaluation-type approach in the episodic POMDP case:

Backward Recursion: For $t=T,\ldots,1$ , estimate $q_t(\cdot)$ via a nonparametric instrumental variable (NPIV) regression by solving empirical analogues of the bridge equations, e.g.,

$\mathbb{E}_n \left[ (Y_{t,i} - \hat{q}_t(W_{t,i},S_{t,i},A_{t,i})) f(Z_{t,i},S_{t,i},A_{t,i}) \right] \approx 0, \quad \forall f \in \mathcal{F}_t$

where $Y_{t,i} = R_{t,i} + \hat{v}_{t+1}(W_{t+1,i}, S_{t+1,i})$ .

Function Classes: $\mathcal{H}_t$ and $\mathcal{F}_t$ are chosen as RKHS balls (Gaussian/Sobolev kernels) for nonparametric flexibility; finite polynomial bases are suitable for small or discrete settings.
Regularization: Tuning parameters $(\lambda,\mu,\delta,M)$ are selected via cross-validation or Lepskii’s method.
Computational Complexity: The bottleneck is often $O(n^3)$ per recursion step in naive kernel methods, mitigable with random features or Nyström approximations.

For infinite-horizon confounded MDPs, two-stage procedures are used: first, estimate stationary density ratios via proxy-based moment equations; then, employ optimal balancing via min–max estimation over weights and potential reward functions (Bennett et al., 2020).

A concrete pseudocode summary for the fitted- $Q$ -evaluation NPIV procedure is provided in (Miao et al., 2022), and for the balancing-weight approach in (Bennett et al., 2020).

5. Theoretical Guarantees and Error Analysis

Proximal OPE methods offer sharp, finite-sample statistical guarantees:

Finite-Sample Error Bounds (Episodic POMDP):

$|V^\pi - \hat{V}^\pi| \lesssim \bar{\tau}_1 \, \max_t \frac{\pi_t}{\mu_t^b} \prod_{t'=1}^T \kappa_{t'} \, T^{7/2} \sqrt{\ln (T/\zeta)} \, n^{-1/(2+1/\alpha)}$

where $\tau_t$ and $\kappa_t$ are measures of local and transition ill-posedness, $\alpha$ characterizes RKHS eigenvalue decay, and $n$ is sample size (Miao et al., 2022).
Semiparametric Efficiency (Bridge-based PRL): Proximal RL estimators attain the semiparametric efficiency bound and are $\sqrt{n}$ -regular and asymptotically normal under mild nuisance estimation rates (Bennett et al., 2021).
Consistency (Infinite Horizon): Under assumptions including ergodicity and completeness of function classes, the bias and variance of the policy value estimator converge at $O_p(n^{-1/2})$ (Bennett et al., 2020).

These results highlight the robustness of the proximal approach to latent confounding and its optimality under appropriate regularity.

6. Practical Recommendations and Observed Performance

Proxy Construction: The utility of the approach depends critically on the availability of action- and reward-inducing proxies ( $Z_t$ , $W_t$ ), chosen based on negative control principles or domain knowledge.
Kernel and Function Class Selection: RKHS-based methods with adaptive kernel width and radius are preferred for complex, high-dimensional data; finite-dimensional bases suffice in low dimensions.
Regularization and Cross-Validation: Regularization in both the NPIV and balancing-weight phases is essential for stability; cross-validation is commonly used for tuning.
Computational Tradeoffs: Proximal OPE is computationally more demanding than standard OPE in fully observed MDPs, requiring solution of sequential NPIV or min–max optimization problems; random features and advanced kernel approximations are often leveraged for scalability.
Empirical Findings: Simulation experiments and real-world case studies (e.g., sepsis management) demonstrate that proximal estimators substantially reduce bias relative to standard OPE, especially under moderate to severe unmeasured confounding. Performance is competitive with, and often superior to, existing methods under partial observability (Bennett et al., 2021, Bennett et al., 2020).

7. Connections and Distinctions within Off-Policy Evaluation

Proximal OPE diverges fundamentally from traditional importance sampling, direct modeling, and standard doubly robust methods by replacing direct observability or ignorability assumptions with proxy-based negative control and nonparametric completeness. It is distinct from "proximal policy optimization" (PPO) and related trust-region methods, which use the "proximal" term in the context of optimization regularization rather than identification or confounding adjustment. Proximal bridge-based approaches unify ideas from proximal causal inference, instrumental variable estimation, and kernel-based identification, and have inspired methodologies for both finite- and infinite-horizon RL in settings plagued by hidden confounders (Miao et al., 2022, Bennett et al., 2021, Bennett et al., 2020).