Robust Policy Iteration

Updated 6 February 2026

Robust Policy Iteration is a method that computes optimal policies under worst-case model perturbations and adversarial disturbances.
It leverages robust Bellman operators and minimax evaluations, using efficient inner optimizations like homotopy algorithms.
The approach extends to deep RL, nonlinear systems, and constrained MDPs, ensuring certified convergence and high disturbance rejection.

A robust policy iteration algorithm is a family of methods designed to compute policies that are optimal under worst-case model uncertainty or adversarial disturbances. In contrast to standard policy iteration, which seeks optimality under a fixed model, robust variants integrate the effects of model perturbations, adversarial dynamics, risk measures, or constraints directly into the evaluation and improvement steps. Robust policy iteration serves as a foundation for robust reinforcement learning in uncertain MDPs, robust optimal control in the H-infinity and game-theoretic settings, and high-assurance RL in real-world environments.

1. Mathematical Models and Problem Formulations

Robust policy iteration is developed for several distinct but related mathematical settings:

Robust Markov Decision Processes (RMDPs)

In RMDPs, the goal is to maximize the expected return under the worst-case choice of model parameters within some uncertainty set. The standard form introduces an uncertainty set for each transition kernel, typically $\mathcal P_{s,a}$ , leading to $(s, a)$ -rectangular or $s$ -rectangular ambiguity models. The robust Bellman operator for a policy $\pi$ is

$(T_R^\pi V)(s) = r(s, \pi(s)) + \gamma \min_{p \in \mathcal P_{s,\pi(s)}} p^\top V,$

where $p$ is a transition vector in the ambiguity set (Asadi et al., 30 Jan 2026).

Differential Games and H-infinity Control

Robust policy iteration is also framed as solving zero-sum differential games, where the adversary selects disturbances to maximize cost. This setting yields the Hamilton-Jacobi-Isaacs PDE. For linear systems, the robust optimal control policy is given by a game-theoretic algebraic Riccati equation (GARE) in both continuous and discrete time (Pang et al., 2020, Sun et al., 2024).

Constrained and Distributionally Robust Scenarios

Robust policy iteration is also extended to robust constrained MDPs, where the agent maximizes return subject to robust (worst-case) constraints (Ganguly et al., 25 May 2025), and to KL-divergence–based distributional robustness, which seeks policies that hedge against deviations from a nominal agent policy (Smirnova et al., 2019).

2. Robust Bellman Operators and Policy Evaluation

The central modification in robust policy iteration resides in policy evaluation, which becomes a minimax or pessimistic fixed-point problem:

RMDPs: The robust evaluation operator is

$(T_R^\pi V)(s) = r(s, \pi(s)) + \gamma \min_{p \in \mathcal P_{s,\pi(s)}} p^\top V.$

For $L_\infty$ ambiguity sets, each inner minimization can be computed efficiently via a homotopy algorithm (Asadi et al., 30 Jan 2026).

H-infinity and Differential Games: Policy evaluation reduces to solving a robust Lyapunov or Riccati equation, e.g., for LQR

$P = Q + K^\top R K + (A + BK)^\top P (A + BK)$

under additive or multiplicative perturbations. For continuous-time settings, robust PI alternates Riccati solvers for nominal and adversarial gains, with the convergence region characterized by local input-to-state stability (Pang et al., 2020, Song et al., 2024).

Options Framework: For robust temporally abstract actions, policy evaluation uses robust value or Q-functions under option-induced transition uncertainties (Mankowitz et al., 2018).
Distributionally Robust PI: Evaluation is replaced with an adversarial (KL-ball constrained) operator:

$[\mathcal T^{\pi, \epsilon} V](s) = \min_{\tilde{\pi} \in \mathcal U_\epsilon(\pi)} \sum_a \tilde{\pi}(a | s)[r(s, a) + \gamma \sum_{s'} P(s' | s, a) V(s')]$

with efficient closed-form solutions via duality (Smirnova et al., 2019).

3. Policy Improvement Mechanisms

Robust policy improvement adapts the classic greedy improvement to the minimax or robust setting:

For robust MDPs:

$\pi_{t+1}(s) = \arg\max_{a} \min_{p \in \mathcal P_{s,a}} [ r(s, a) + \gamma p^\top V^t ].$

In $L_p$ -robust $s$ -rectangular models, the greedy policy has a threshold property: it selects the "top- $k$ " actions whose Q-advantage exceeds a particular threshold, with probabilities weighted polynomially in the advantage (Kumar et al., 2022).

In robust options, the “inter-option” policy is improved with respect to the robust Q-function, possibly using robust policy gradients with discounted occupancy corrections (Mankowitz et al., 2018).
In constrained robust MDPs, the improvement step is replaced by a mirror descent on the most violated constraint (or objective), producing provably efficient algorithms (iteration complexity $O(\epsilon^{-2})$ ) (Ganguly et al., 25 May 2025).
For Markov games and robust zero-sum settings, robust improvement seeks a saddle-point pair using joint greedy backups (max-min operators), usually realized via mixed strategies and equilibrium solvers (Badger et al., 8 Aug 2025).

4. Algorithmic Structures and Pseudocode

Robust policy iteration algorithms share an outer loop alternating robust evaluation and robust improvement. Prototypical structure for $(s,a)$ -rectangular $L_\infty$ -RMDPs is summarized:

Initialize agent policy π⁰ arbitrarily
repeat
    // Robust policy evaluation
    For each s: compute V^{π^t}(s) via inner minimization over p ∈ P_{s,π^t(s)}
    // (often using an efficient homotopy or potential-based algorithm)
    // Robust policy improvement
    For each s: π^{t+1}(s) ← argmax_a [r(s,a) + γ min_{p∈P_{s,a}} pᵀ V^t]
until π^{t+1} = π^t

Variants include partial robust evaluated backups (Ho et al., 2020), mirror-descent improvements in constrained settings (Ganguly et al., 25 May 2025), and FBSDE-based evaluation in stochastic continuous control (Wang et al., 2022).

5. Convergence, Stability, and Complexity Theory

Modern robust policy iteration analysis addresses both exact and inexact updates:

Strongly Polynomial Complexity: For $L_\infty$ $(s,a)$ -rectangular robust MDPs, robust policy iteration achieves strongly polynomial time complexity when $\gamma$ is fixed, matching Ye's bound for standard MDPs. This is enabled by an efficient homotopy solver for inner minimizations and a potential-function analysis bounding the number of possible improvement steps (Asadi et al., 30 Jan 2026).
Input-to-State Stability (ISS): For LQR and Riccati-based robust PI, robust stability is shown in the ISS sense: bounded per-iteration errors induce only a bounded distance to the optimal solution. As the error vanishes, convergence to the optimal policy is restored (Song et al., 2024, Pang et al., 2020, Pang et al., 2020).
Robust Convergence Under Approximate Evaluation: In settings using function approximation, stochastic approximation, or finite samples, convergence is guaranteed to a neighborhood of the robust optimum, with explicit bounds on performance gap proportional to noise or estimation error (Panaganti et al., 2020, Wang et al., 2022, Song et al., 2024, Pang et al., 2020).
Empirical Rate and Complexity: Partial Policy Iteration (PPI) for $L_1$ -robust MDPs demonstrates that robust PI converges linearly in the number of Bellman operators, with each operator evaluation admitting quasi-linear complexity in state space (Ho et al., 2020).
Distributionally Robust Variants: DRPI algorithms provide a provable finite-sample lower bound at each iteration and ensure convergence as the uncertainty ball decays with the sample count, balancing safety and exploitation (Smirnova et al., 2019).

6. Extensions: Deep RL, Nonlinear Systems, and Options

Robust policy iteration has been generalized in several significant directions:

Deep Neural Architectures: Robust options policy iteration (ROPI) and robust option-DQN (RO-DQN) extend the framework to deep value function and option-head architectures, incorporating robust targets within end-to-end deep frameworks (Mankowitz et al., 2018).
Nonlinear and Unknown Systems: For continuous-time nonlinear systems and unknown plant dynamics, robust PI adopts gradient-based or sample-based FBSDE solvers, incremental (RLS-based) identification, and adaptively learned linearizations to maintain performance guarantees (Meng et al., 29 Aug 2025, Wang et al., 2022, Li et al., 2020, Sun et al., 2024).
Policy Iteration under Recursive Feasibility Constraints: In undiscounted nonlinear settings, modifications such as PI $^+$ regularize the improvement/evaluation steps to ensure recursive feasibility and robust stability for general state attractors, with explicit Lyapunov constructions (Granzotto et al., 2022).
Zero-Sum Games and Saddle-Point PI: In Markov games and robust zero-sum settings, modern algorithms such as RCPI ensure convergence by tracking the Bellman residual and falling back to robust value iteration when needed, outperforming previous PI-based methods (Badger et al., 8 Aug 2025).

7. Empirical Outcomes and Practical Implementation

Empirical validations across diverse domains confirm the merits of robust policy iteration:

Performance under Model Misspecification: Robust PI methods maintain near-optimal performance under significant model error or parameter uncertainty in standard benchmarks (CartPole, Acrobot, LQR), while non-robust methods often fail catastrophically (Mankowitz et al., 2018, Song et al., 2024).
Scalability: Homotopy and sorting-based robust Bellman operators enable robust PI to scale to MDPs with thousands of states in computation times close to those of non-robust counterparts (Ho et al., 2020).
Deep Learning and Generalization: For deep RL, explicit robust PI enhances generalization over a broader range of dynamics compared to non-robust deep architectures (Mankowitz et al., 2018).
Robustness to Disturbances: In nonlinear and stochastic control tasks, robust PI provides high disturbance rejection and maintains stability even under large perturbations or model mismatch (Li et al., 2020, Sun et al., 2024, Meng et al., 29 Aug 2025).
Safety, Constraints, and High Assurance: For robust CMDPs, recent algorithms attain strict feasibility with provable iteration complexity, outperforming approaches based on primal-dual or epigraph search (Ganguly et al., 25 May 2025).

These developments position robust policy iteration as a core algorithmic primitive in robust dynamic programming, optimal control, and high-reliability reinforcement learning, with rigorous guarantees on convergence, robustness, and computational efficiency across a broad spectrum of uncertainty and adversarial models.