Offline Model Guidance in Adversarial Markov Games

Updated 8 January 2026

State-action-conditional offline model guidance is a set of algorithmic principles that optimize robust policy learning using historical model data in adversarial, nonstationary environments.
It employs methods like hyperpolicy mirror descent and optimistic policy evaluation to navigate uncertainty and adversarial perturbations in Markov games.
The framework achieves sublinear policy regret by leveraging offline evaluations, function approximation complexities, and ensuring approximate equilibrium in multiagent interactions.

State-action-conditional offline model guidance is a set of algorithmic principles and methodologies for learning robust, low-regret policies in nonstationary, adversarially driven Markov games. This setting is characterized by uncertainty or adversarial perturbations in the transition dynamics and/or observation model, such that the learner faces an adaptive or adversarial opponent whose strategies (policies) may shift arbitrarily across episodes. Cutting-edge algorithms in this domain guide the optimization of state-action-conditional value functions or distributions over policy spaces (“hyperpolicies”) using information from offline or previously revealed models of the environment and opponent. This paradigm incorporates and advances ideas from robust reinforcement learning, mirror descent, optimism-driven exploration, and policy-regret minimization.

1. Formalism: State-Adversarial Markov Games

State-Action-Conditional Offline Model Guidance has been studied primarily under the formalism of State-Adversarial Markov Games (SAMGs), incorporating the following elements:

State space $\mathcal{S}$ : finite or infinite, determining the possible configurations of the environment.
Player’s action space $\mathcal{A}$ ; adversary’s action space $\mathcal{B}$ .
Policies: At each episode $t$ and step $h$ , the learner commits to a (possibly nonstationary) policy $\mu^t_h(\cdot|s)\in\Delta(\mathcal{A})$ , while the adversary independently selects a policy $\nu^t_h(\cdot|s)\in\Delta(\mathcal{B})$ .
Transition kernel: State transitions follow $P_h(s'|s,a,b)$ , which is typically unknown to the learner.
Reward: The learner receives reward $r_h(s,a,b) \in [0,1]$ after joint action $(a,b)$ in state $s$ .
Regret metric: Evaluation is carried out against the best fixed policy in hindsight, via cumulative regret:

$\Regret(K) = \max_{\mu\in\Pi}\;\sum_{t=1}^K [V^{\mu\times\nu^t}_1(s_1) - V^{\mu^t\times\nu^t}_1(s_1)].$

This setting accounts for (i) adversaries whose policy for each episode is unknown to the learner at play time but may be revealed afterward (“policy-revealing”), and (ii) adaptive adversaries whose policy can vary in response to the learner's previous policies or actions (Zhan et al., 2022, Nguyen-Tang et al., 2024).

2. Hyperpolicy Mirror Descent and Optimistic Policy Evaluation

Modern algorithms maintain a distribution over policies—referred to as a "hyperpolicy," denoted $p^t \in \Delta(\Pi)$ —and update this distribution iteratively via mirror descent guided by optimistic estimates of value. At the beginning of episode $t$ :

The learner samples a base policy $\mu^t$ from $p^t$ .
The selected policy $\mu^t$ is executed against the (hidden) adversary strategy $\nu^t$ .
After the episode, the adversary’s policy $\nu^t$ is revealed, and an optimistic value estimate $V^t(\mu)$ is computed for each $\mu \in \Pi$ .
The hyperpolicy is updated according to the mirror descent rule, typically instantiated as exponential-weights:

$p^{t+1}(\mu) \propto p^t(\mu)\exp\left(\eta V^t(\mu)\right).$

Optimistic least-squares policy evaluation (OptLSPE) constructs a confidence set around the value-function estimates for each policy, ensuring that $V^t(\mu) \geq V^{\mu\times\nu^t}_1(s_1)$ with high probability, thereby facilitating optimism-driven policy search (Zhan et al., 2022).

3. Regret Guarantees and Complexity Measures

A central theoretical achievement is the establishment of sublinear regret bounds in adversarially nonstationary environments. Under appropriate realizability and completeness conditions for the function class $\mathcal F$ and its Bellman images $\mathcal G$ , alongside control via the Bellman Evaluation Eluder (BEE) dimension $\dim_{\rm BEE}(\mathcal{F},\Pi,\Pi')$ , the following bound holds with high probability:

$\Regret(K) \leq \tilde{O}\big(H \sqrt{K \, \dim_{\rm BEE}(\mathcal F, \Pi, \Pi')\, \ln(N|\Pi|/\delta)}\big).$

This result holds for both stationary and policy-revealing adversaries, and in the adaptive adversary regime at the cost of an extra $\sqrt{\ln|\Pi'|}$ factor (Zhan et al., 2022). Significantly, achieving sublinear policy regret in the presence of adaptive adversaries generically requires bounding the adversary's memory and imposing “consistency” constraints—namely, that the adversary responds similarly for similar learner strategies (Nguyen-Tang et al., 2024). If the adversary has unbounded memory or is fully nonstationary, no sublinear regret is possible.

4. Function Approximation and Bellman Evaluation Eluder Dimension

Algorithmic feasibility and convergence are tightly linked to the statistical complexities of the function classes involved. The main assumptions are:

Realizability: For every joint policy, the true $Q$ -function lies in the function class $\mathcal{F}$ .
Generalized completeness: The Bellman operator images $T^{\mu,\nu}_h f_{h+1}$ are contained in auxiliary class $\mathcal{G}_h$ for all $f\in\mathcal{F}$ and policies $\mu, \nu$ .

The Bellman Evaluation Eluder (BEE) dimension measures how quickly least-squares errors in Bellman residuals concentrate and thus controls the regret scaling. Covering numbers of $\mathcal{F}$ and $\mathcal{G}$ further parameterize the statistical rates via confidence bounds employed in policy evaluation.

5. Relation to Robustness, Policy Regret, and Offline Estimation

State-Action-Conditional Offline Model Guidance is closely connected to several robustness and regret notions:

Policy Regret (as opposed to classical external regret): In adaptive adversarial settings, learner performance is measured relative to the best fixed policy sequence in hindsight, accounting for the adversary's responses to those hypothetical choices. Efficient algorithms for policy regret, such as OPO-OMLE (for 1-memory consistent adversaries) and APE-OVE (for general $m$ with consistent adversaries), attain $\tilde O(\sqrt T)$ regret, subject to statistical complexity and coverage assumptions (Nguyen-Tang et al., 2024).
Robust Agent Policies: In the setting of state-adversarial Markov games with perturbations on state-observations, classical pointwise optimal policies or robust Nash equilibria may not exist. Instead, maximizing the worst-case expected value (averaged over initial-state distribution) restores well-posedness and existence of a maximin policy (Han et al., 2022).
Offline Model Guidance: DORIS and related approaches leverage past experience, empirical value-function estimates, and confidence sets to drive the exploration-exploitation trade-off, validating policy updates via offline evaluations and worst-case model uncertainty (optimistic planning analogous to model-based RL with uncertainty sets) (Zhan et al., 2022, Wei et al., 2017).

6. Multiagent Interactions and Equilibrium Guarantees

When all agents in a multiagent system deploy decentralized optimistic mirror descent (e.g., via DORIS), the time-averaged mixture of their policies forms an approximate coarse correlated equilibrium (CCE) of the underlying Markov game. The approximation error is bounded explicitly in terms of the agents' regret, the statistical complexity (e.g., BEE dimension), and the number of episodes:

$\epsilon = \tilde O\left(\max_i \frac{H\sqrt{\dim_{\rm BEE}(\mathcal F_i,\Pi_i,\Pi_{-i})\,\ln(N_i|\Pi_i||\Pi_{-i}|/\delta)}}{\sqrt K}\right).$

Averaging across agents’ historical policies preserves robustness even in dynamically changing or adversarial environments (Zhan et al., 2022).

7. Extensions and Open Problems

Current methodologies offer statistical efficiency under bounded function classes and consistent, policy-revealing adversaries; however, several fundamental questions remain. Notably, whether $d^*$ -dependence in regret bounds is information-theoretically necessary, and whether other complexity notions can supplant or complement the BEE dimension for general function approximation, are open problems (Nguyen-Tang et al., 2024). Practical algorithms for large-scale, function-approximation, or deep RL settings are still information-theoretic; computational efficiency in these domains is largely unresolved. Extending model guidance and policy-regret frameworks to settings with continuous spaces, partial observability, or general-sum games is a key frontier (Han et al., 2022, Wei et al., 2017, Nguyen-Tang et al., 2024).