Policy-Conditional Performance Bounds

Updated 13 January 2026

Policy-conditional performance bounds are analytical guarantees that measure the gap between an evaluated policy and the optimal benchmark under given uncertainties.
They leverage structural methods, such as affine mappings and error propagation analyses, to yield computable, explicit bounds in domains like robust optimization and reinforcement learning.
These bounds facilitate safe policy deployment and improvement by quantifying risk and performance shortfalls in the presence of model errors and partial information.

Policy-conditional performance bounds are analytical or empirical guarantees on the expected or worst-case performance of a specific decision policy, given either full or partial knowledge of the environment, the learning process, or the sources of uncertainty. Unlike minimax or worst-case bounds that apply to the optimal policy or to classes of policies, policy-conditional bounds characterize the performance shortfall, risk, or robustness of an explicitly identified policy or policy form. Such results are fundamental in robust optimization, reinforcement learning (RL), control, and statistical policy evaluation, guiding both principled deployment and safe policy improvement.

1. Foundational Concepts and Formal Definitions

Policy-conditional performance bounds quantify, for a chosen policy π (or a restricted policy class), the value function gap, regret, or cost shortfall relative to optimality, robustness, or specific uncertainty sets. The canonical structure is:

Upper bound on suboptimality:

$V^*(s) - V^{π}(s) \leq f(π; \text{model error},\, \text{approximation},\, \text{sample size},\, \text{uncertainty})$

Lower bound on return/cost:

$J(π;\,\text{worst-case environment}) \geq \text{Bound}(π,\,\cdot)$

These bounds may depend on:

Policy representation: affine (linear in disturbance), non-stationary, randomized, memory-limited, or value-function greedy.
Model discrepancy: deviation between true and estimated dynamics, reward uncertainty, or partial identification.
Relaxation or approximation error: from finite-dimensional approximations, iterative solvers, or surrogate optimization problems.
Probabilistic/statistical uncertainty: from finite rollouts or Bayesian posteriors (Vincent et al., 2024, Brown et al., 2017).

Policy-conditional bounds are central to practical safety, trustworthiness, and tractable policy learning across robust optimization (Housni et al., 2018), RL (Scherrer, 2013), adaptive control (Moreno-Mora et al., 2021), and causal inference (Ben-Michael, 13 Jun 2025).

2. Structural Bounds in Robust Optimization and Adjustable Control

In two-stage robust optimization (ARO) with budgeted uncertainty, the key question is: how far is an explicitly constructed (e.g., affine) policy from the AR-optimal policy? For uncertainty set $U = \{h \in [0,1]^m : \sum_i w_i h_i \le 1\}$ , (Housni et al., 2018) establishes the seminal result:

The cost of the best affine policy satisfies

$z_{\text{Aff}}(U) \le O\left(\frac{\log n}{\log \log n}\right) \cdot z_{\text{AR}}(U)$

where $z_{\text{AR}}(U)$ is the optimal value with general (measurable) policies.

This matches the AR hardness of $Ω(\log n/\log\log n)$ approximation, showing affine policies are optimal up to constants for budgeted sets.

The construction uses a partition into "linear" and "static" components based on explicit structural lemmas, so any given affine policy's cost has an interpretable upper bound in terms of covering costs and the right-hand-side perturbation requirements. The policy-conditional nature is fundamental: given the explicit (restricted) affine map $y(h)$ , its worst-case robust cost is exactly computable and matches this structural bound (Housni et al., 2018).

3. Policy-Conditional Bounds in Reinforcement Learning

3.1. Discounted MDPs, Value/Policy Iteration, and Approximate Greedy

When policy iteration, direct policy search, or approximate dynamic programming is used, various flavors of policy-conditional bounds quantify the loss incurred by approximate, non-stationary, or partially optimized policies:

Direct Policy Iteration (DPI) and Conservative Policy Iteration (CPI):
- DPI: For $ε$ -greedy oracles,
$\mu(v_* - v_{π_k}) \le \frac{C^{(2)}}{(1-\gamma)^2} \max_i ε_i + O(\gamma^k)$

where $C^{(2)}$ is a concentrability constant, $\mu$ is an initial distribution, and $v_{π_k}$ the policy's value. This upper-bounds the value loss of the $k$ th DPI policy, not just the optimal (Scherrer, 2013). - CPI and NSDPI: Improved constants are possible by tracking optimal-policy occupancy or nonstationary plans.
λ-Policy Iteration (λ-PI):
- For approximate λ-PI, the policy at iteration $k$ satisfies
$\| v_* - v^{π_k} \|_\infty \le γ^{k}/(1-\gamma) ⋅ \| v_* - v_0 \|_\infty$

for any chosen initialization $v_0$ (0711.0694).
Non-stationary Approximate Policy Iteration:
- Maintaining a window of $\ell$ policies in modified policy iteration, the loss of the resulting $\ell$ -periodic policy is bounded as
$\| v_* - v_{π_{k,\ell}} \|_\infty \le \frac{2(\gamma-\gamma^k)}{(1-\gamma)(1-\gamma^{\ell})} ε + \frac{2\gamma^k}{1-\gamma} \| v_* - v_0\|_\infty$

which yields an $O(1-\gamma)$ improvement in the error amplification factor for large $\gamma$ , conditioned on the constructed periodic policy (Lesner et al., 2013).

3.2. Model-Based and Partial-Information Bounds

Model-based RL with Factored Linear Models:
- For any policy derived from a compressed factored model, suboptimality can be bounded by
$\| V^{*} - V^{\hat{\pi}} \|_\infty \leq \min\{\epsilon_1(V^*), \epsilon_2\} + \min\{\epsilon_1(V^{\hat{\pi}}), \epsilon_2\}$

where $\epsilon_1$ depends on model error at $V^*, V^{\hat{\pi}}$ , and $\epsilon_2$ on the compressed policy's own single approximation (Pires et al., 2016).
Continuous-state ADP/LinProg Methods:
- When $Q$ - or $V$ -function approximations are available, the greedy (or iterated-greedy) policy satisfies
$J(\pi) - J(\pi^*) \leq \frac{2\gamma}{(1-\gamma)^2} \epsilon_{\text{fit}} + O(\gamma^k)$

where $\epsilon_{\text{fit}}$ is the residual fit error for the approximate value function, and the bound holds for the specific computed policy (Beuchat et al., 2016).

4. Statistical and Probabilistic Policy-Conditional Guarantees

A growing area is high-confidence, policy-conditional performance bounds based on empirical rollouts or Bayesian inference:

Distribution-Free Lower Bounds for Policy Evaluation:
- Given $N$ independent rollouts of policy $\pi$ in a fixed environment, the true performance CDF $F_\pi(x)$ is bounded above (i.e., performance is stochastically worse) by $\overline{F}_N(x)$ with confidence $1-\alpha$ :
$\Pr[ F_\pi(x) \leq \overline{F}_N(x),\, \forall x ] \geq 1-\alpha$

where $\overline{F}_N(x)$ is computed via tight Kolmogorov-Smirnov (KS, for continuous) or uniformly most accurate binomial intervals (for binary) (Vincent et al., 2024).
Bayesian IRL α–Value-at-Risk Bounds:
- For inverse RL, using the Bayesian posterior on reward weights, and fixing any evaluation policy $\pi$ , one computes
$B_\alpha(\pi) := \inf \left\{b : P_w[\Delta(\pi,R_w) \leq b] \geq 1-\alpha \right\}$

for sampled $w \sim P(w|D)$ . This is a policy-conditional bound on the worst-case policy loss at risk level $\alpha$ , with high-confidence statistical validity (Brown et al., 2017).

5. Specialized Policy-Conditional Bounds in Adaptive and Robust Control

Adaptive Model Predictive Control (MPC):
- Under parametric uncertainty $\theta^*$ and initial estimate $\hat{\theta}_0$ , the adaptive MPC closed-loop cost $J_\infty^{\text{MPC}}(x_0)$ is bounded above by
$J_\infty^{\text{MPC}}(x_0) \leq \alpha_V V_\infty(x_0) + \alpha_f + \alpha_\Delta + a(\|\tilde{\theta}_0\|, \mu)$

where $V_\infty(x_0)$ is the true infinite-horizon optimal cost, and $a(\cdot,\cdot)$ quantifies additional cost due to initial parameter error and estimator gain. This quantifies suboptimality for any concrete adaptive control law (Moreno-Mora et al., 2021).
Constrained Linear Min-Max Control:
- For any admissible policy $\pi$ , a universal quadratic lower bound is constructed:
$J_\infty(x_0, \pi) \geq x_0^\top P^\star x_0$

where $P^\star$ is optimized over an unconstrained $H_\infty$ -type surrogate problem. This bound holds policy-by-policy and can directly evaluate or compare controllers' worst-case performance (Summers et al., 2013).

6. Advanced Policy-Conditional Bounds: Multitask, Partial ID, and Beyond

6.1. Multitask LQR

In multitask LQR (N distinct LQR problems, sharing a common policy), suboptimality of the policy-gradient solution for each task $i$ can be bounded in terms of bisimulation-inspired, closed-loop gradient discrepancies:

$J^{(i)}(K_n) - J^{(i)}(K_*^{(i)}) \leq C \cdot b_i(K_n)^2$

with constants $C$ depending on steady-state covariance and control cost, and $b_i$ the bisimulation-based gradient difference norm. These bounds are much tighter than classical open-loop measures and are policy-conditional for each $K_n$ generated by gradient descent (Stamouli et al., 23 Sep 2025).

6.2. Partial Identification and Conditional Linear Programs

In statistical settings where the policy value is only partially identified, bounds are given in terms of solutions to conditional linear programs (CLPs):

$\theta_L(\pi) = E_X \left[ \min_{p \in \mathcal{P}(X)} \langle c(X, \pi(X)), p \rangle \right ]$

with $c$ the per-unit utility (policy-dependent), $p$ feasible under conditional linear constraints, and both plug-in and entropic-regularized estimators available with valid Wald confidence intervals (Ben-Michael, 13 Jun 2025). These policy-conditional estimates support robust policy selection and minimax regret learning in non-identifiable systems.

7. Policy-Conditional Bounds for POMDPs and Nonlinear Criteria

Sliding Window and Myopic Bounds in POMDPs:
- For near-optimal policies of the form “finite sliding window,” the suboptimality is explicitly bounded in terms of filter forgetfulness:
- In expectation (Wasserstein):
$\mathbb{E}[|J_\beta(\phi^N) - J^*_\beta|] \leq C_\text{mix} \cdot \sum_t \beta^t \bar{L}^N_t$ - Uniformly (total-variation, under strong mixing):

$\sup_z |J_\beta(z, \gamma_N) - J^*_\beta(z)| \leq C_\text{mix}' \cdot \bar{L}_\text{TV}^N$

with all dependence on the explicit filter stability rate (Demirci et al., 2024).
Myopic Policy Bounds for Information Gathering:
- Structural monotonicity enables efficient computation of myopic lower and upper policy bounds. For any belief $b$ ,
$\underline{\mu}_1^*(b) \leq \mu^*(b) \leq \overline{\mu}_1^*(b)$

and recursively, $L_k(b) \leq V_k^*(b) \leq U_k(b)$ , providing explicit statewise lower and upper bounds for optimal policy performance, as well as corresponding guarantees for policies that commit to these (myopic or interval) decision rules (Lauri et al., 2016).

References Table

Setting/Domain	Policy-Conditional Bound Type	Reference
Two-stage robust optimization (budgeted uncertainty)	Affine policy approximation ratio	(Housni et al., 2018)
Direct/Conservative Policy Iteration	DPI/CPI/NSDPI value loss	(Scherrer, 2013)
Approximate λ-Policy Iteration	Error propagation in λ-PI	(0711.0694)
Non-stationary Modified Policy Iteration	Windowed policy suboptimality	(Lesner et al., 2013)
Factored linear model RL	Model-based policy loss	(Pires et al., 2016)
Continuous-state ADP/LinProg	Greedy/iterated policy suboptimality	(Beuchat et al., 2016)
Rollout-based guarantees (BC, RL)	Distributional high-confidence bound	(Vincent et al., 2024)
Bayesian IRL	α-VaR posterior policy loss	(Brown et al., 2017)
Adaptive MPC	Estimation- and error-dependent bound	(Moreno-Mora et al., 2021)
Constrained min-max control	Surrogate LMI lower bound	(Summers et al., 2013)
Multitask LQR	Bisimulation-based gradient gap	(Stamouli et al., 23 Sep 2025)
Partial identification CLPs	Empirical/worst-case value bound	(Ben-Michael, 13 Jun 2025)
POMDP sliding window	Filter-stability error bound	(Demirci et al., 2024)
Myopic/structural bounds for information POMDPs	Statewise/action interval bounds	(Lauri et al., 2016)

Summary

Policy-conditional performance bounds form a unifying principle across optimization, control, and learning. They quantify the performance degradation, safety margin, or robustness for any explicitly constructed, simulated, or learned policy under model error, partial identification, or empirical uncertainty. The state-of-the-art spans structural bounds exploiting affine policies, explicit model error propagation in ADP and RL, empirical/stochastic band guarantees for black-box policies, and advanced statistical inference in the presence of incomplete data. These results promote both tractability and interpretability, and are central to the design and safe deployment of robust policy-driven systems.