Papers
Topics
Authors
Recent
Search
2000 character limit reached

State-Adversarial Markov Games

Updated 8 January 2026
  • SAMGs are multi-agent reinforcement learning frameworks that integrate adversarial perturbations to state observations, modeling uncertainty from sensor errors or malicious attacks.
  • They introduce robust solution concepts like min-max dynamic programming and robust agent policies to address the breakdown of traditional Nash equilibria.
  • Algorithms such as RMA3C empirically improve robustness, achieving up to 58.5% mean reward gains under adversarial noise and scaled agent settings.

A State-Adversarial Markov Game (SAMG) is a formal multi-agent reinforcement learning (MARL) framework that explicitly incorporates adversarial perturbations to agents’ state observations, modeling state uncertainty within Markov games. SAMGs generalize standard Markov games by allowing an adversary to select, at each step, which (possibly perturbed) observations the agents receive—potentially inducing severe robustness challenges for both the existence and computation of equilibrium solutions. The framework has yielded novel solution concepts and algorithms, particularly addressing the breakdown of traditional optimality and Nash equilibrium notions under state uncertainty (Han et al., 2022).

1. Formal Model and Definitions

An SAMG is specified by the tuple G=(N,S,{Ai}i,p,{Ri}i,γ,Δ)G = (N, S, \{A^i\}_i, p, \{R^i\}_i, \gamma, \Delta):

  • N={1,...,n}N = \{1, ..., n\}: agent set.
  • SS: finite global state space.
  • For each iNi \in N, AiA^i: local action set; joint action a=(a1,...,an)a = (a^1, ..., a^n), with aiAia \in \prod_i A^i.
  • p(ss,a)p(s'|s,a): Markovian state transition dynamics (not manipulated by the adversary).
  • Ri(s,a)R^i(s,a): stage reward for agent ii, often r(s,a)r(s,a) shared.
  • γ[0,1)\gamma \in [0,1): discount factor.
  • For every sSs \in S, the adversary selects a perturbed observation ρ=(ρ1,...,ρn)ΔsSn\rho = (\rho^1, ..., \rho^n) \in \Delta_s \subseteq S^n. The collection Δ={Δs:sS}\Delta = \{\Delta_s : s \in S\} defines admissible observation sets.
  • Agent ii’s policy πi(ρi)\pi^i(\cdot|\rho^i) maps local observation to actions; the adversary policy χ(ρs)\chi(\rho|s) specifies a distribution over Δs\Delta_s.

Each time step, the environment draws the true state ss, adversary samples ρ\rho from χ(s)\chi(\cdot|s), agents select aia^i based on ρi\rho^i, environment transitions via pp.

This modeling framework captures threat models where the adversary may represent noisy sensors, malicious attackers, or systemic perceptual bias—providing a controlled, game-theoretic context for analyzing MARL robustness to state perturbations (Han et al., 2022, Wei et al., 2017).

2. Solution Concepts and Value Functions

2.1 Robust Value Functions and Min-Max Dynamic Programming

For policy profile π\pi, the robust value function is

vπ(s)min(ρ0,ρ1,...):ρtΔstEπ[t=0γtr(st,at)s0=s,atπ(ρt),st+1p(st,at)]v^\pi(s) \equiv \min_{(\rho_0, \rho_1, ...): \rho_t \in \Delta_{s_t}} \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \mid s_0 = s,\, a_t \sim \pi(\cdot|\rho_t),\, s_{t+1} \sim p(\cdot|s_t, a_t) \right]

This embodies the min-max principle: agents face the worst-case sequence of allowed adversarial perturbations.

This robust value function satisfies a Bellman-type min-max recursion:

vπ(s)=minρΔsEaπ(ρ)[r(s,a)+γsp(ss,a)vπ(s)]v^\pi(s) = \min_{\rho \in \Delta_s} \mathbb{E}_{a \sim \pi(\cdot|\rho)} \left[ r(s, a) + \gamma \sum_{s'} p(s'|s, a) v^\pi(s') \right]

2.2 Nonexistence of Standard Optima and Nash Equilibrium

In contrast to standard MARL, SAMGs may lack

  • Global robust optima: There need not exist a “totally optimal” policy π\pi^* achieving vπ(s)vπ(s)v^{\pi^*}(s) \geq v^\pi(s) for all ss and all π\pi under adversarial perturbations.
  • Robust Nash equilibrium: Stage-wise Nash equilibria (where each agent best-responds under min-max value) may not be globally consistent: a single policy profile may fail to align all stage-wise equilibria (Theorem 4.7 in (Han et al., 2022)).

2.3 Robust Agent Policy: New Solution Concept

The robust agent policy addresses this deficiency by instead maximizing the worst-case expected start-state value:

J(π)Es0μ0[vπ(s0)];πargmaxπJ(π)J(\pi) \equiv \mathbb{E}_{s_0 \sim \mu_0}\left[ v^\pi(s_0) \right];\quad \pi^* \in \arg\max_\pi J(\pi)

where μ0\mu_0 is a start-state distribution. This yields a solution even when global optima and Nash equilibrium fail to exist.

3. Existence Results and Theoretical Properties

Existence theory for robust agent policies in SAMGs proceeds as follows:

  • Stage-wise robust value function: For any fixed opponents and adversary, contraction mapping (Banach’s fixed-point theorem) guarantees existence and uniqueness (Theorem 4.5).
  • Stage-wise equilibrium: At each state ss, the associated nn-agent vs nn-adversary normal-form game has at least one max-min equilibrium (Theorem 4.6).
  • Global robust Nash: In general, no policy profile aligns all per-state stage-wise robust equilibria.
  • Robust agent policy: For finite S,AS, A, the functional F(π)=minχJ(π,χ)F(\pi) = \min_{\chi} J(\pi, \chi)—with J(π,χ)=Es0[Vπ,χ(s0)]J(\pi, \chi)=\mathbb{E}_{s_0}[V_{\pi,\chi}(s_0)]—is continuous over compact policy spaces. By applying the extreme value theorem, a robust agent policy maximizing F(π)F(\pi) always exists (Theorem 4.11 in (Han et al., 2022)).

This result is fundamental: it ensures a well-posed optimization objective for learning robust MARL policies under state adversaries.

4. Robust Multi-Agent Adversarial Actor-Critic (RMA3C) Algorithm

The RMA3C algorithm is designed to learn robust policies in SAMGs by solving the min-max problem

maxπminχEs0[Vπ,χ(s0)]\max_\pi \min_\chi \mathbb{E}_{s_0} \left[ V_{\pi, \chi}(s_0) \right]

using alternating gradient descent (on agent policies) and ascent (on adversary policies), with a centralized critic.

Algorithmic Steps (per agent ii):

  • Initialize centralized critic Qϕ(s,a)Q_\phi(s, a), actor πθi\pi^i_\theta, adversary χψi\chi^i_\psi, and targets.
  • Training episode:
    • At state ss, adversary samples ρiχψi(s)\rho^i \sim \chi^i_\psi(\cdot|s); agent samples aiπθi(ρi)a^i \sim \pi^i_\theta(\cdot|\rho^i).
    • Execute joint action, observe r,sr, s', store (s,a,r,s)(s, a, r, s') in replay buffer DD.
  • Update:
    • Sample minibatch from DD; update QϕQ_\phi to minimize TD target error:

    y=r+γQϕ(s,a)y = r + \gamma Q_{\phi'}(s', a')

    where ai=πi(ρi)a'^{\,i} = \pi^{i'}(\rho'^{\,i}), ρiχi(s)\rho'^{\,i} \sim \chi^{i'}(\cdot|s'). - Update (θ,ψ)(\theta, \psi) by kk gradient steps on

    Jbatch(θ,ψ)=1B(s,a,)BQϕ(s,a)J_\text{batch}(\theta, \psi) = \frac{1}{|B|}\sum_{(s,a,\,\cdot)\in B}Q_\phi(s,a)

  • Policy gradient update:

    • θiJEs,ρχψ,sD[θlogπθi(aiρi)Qϕ(s,a)]\nabla_\theta^i J \approx \mathbb{E}_{s, \rho \sim \chi_\psi, s \sim D} \left[\nabla_\theta \log \pi^i_\theta(a^i|\rho^i) Q_\phi(s, a)\right]
    • ψiJEs,aπθ,sD[ψlogχψi(ρis)Qϕ(s,a)]\nabla_\psi^i J \approx -\mathbb{E}_{s, a \sim \pi_\theta, s \sim D} [\nabla_\psi \log \chi^i_\psi(\rho^i|s) Q_\phi(s, a)]

This alternating update structure operationalizes the theoretical robust agent policy solution within a scalable DRL framework (Han et al., 2022).

5. Empirical Evaluation and Robustness

RMA3C was evaluated on several MARL environments—Cooperative Navigation, Exchange Target, Keep-Away, and Physical Deception (from [Lowe et al. 2017])—under three classes of state perturbations:

  • Truncated normal noise N(0,λ,u,l)N(0, \lambda, u, l).
  • Fixed, well-trained adversarial policy χ\chi^* (max-episode adversary).
  • On-policy adversaries trained jointly with agents.

Baselines considered were MADDPG, M3DDPG (min-max DDPG), and MAPPO. RMA3C achieved significant empirical robustness:

  • Up to 58.5% mean reward improvement over baselines under both random and adversarial noise during training.
  • Test-time mean reward improvement up to 46.6% (random noise) and up to 54.0% (adversary).
  • Robustness sustained in scaling to 4–6 agents in Cooperative Navigation. A plausible implication is that the robust agent policy min-max criterion, as approximated by RMA3C, can substantially improve resilience in adversarial or noisy MARL settings (Han et al., 2022).

6. Relations to General Stochastic Game Learning and Limitations

SAMGs, as presented in (Han et al., 2022), focus on robust MARL under adversarial state perturbation, diverging from classical zero-sum stochastic games (SGs) where both the transition kernel and the reward may be adversarial. The UCSG (Upper Confidence for Stochastic Games) approach (Wei et al., 2017) targets sample-efficient online learning with regret/safety guarantees in zero-sum SGs, incorporating confidence sets and extended value iteration for efficient exploration and optimization.

Key differences:

  • SGs (in (Wei et al., 2017)) consider direct adversarial action on system dynamics and reward, and develop algorithms (UCSG) with provable regret/sample complexity bounds relative to the game value, depending on game diameter DD, state and action set sizes.
  • SAMGs’ core novelty lies in adversarial perturbations to the agents’ perception of state rather than to transition dynamics or reward, modeling realistic sensor attacks and uncertainty in MARL.

A limitation in current SAMG theory is the high computational complexity of robust learning (e.g., maintaining and optimizing over adversary policy classes, scalability of alternating min-max optimization), as well as the lack of global robust Nash solutions—necessitating the robust agent policy min-max criterion.

7. Extensions and Open Directions

Suggested directions for further development include:

  • Generalizing beyond Markov games with finite state/action spaces to function approximation settings.
  • Exploring more intricate adversarial perturbation models, including those affecting transition or reward (blending with SG literature (Wei et al., 2017)).
  • Studying learning in settings with partial observability or continuous space.
  • Algorithmic improvements, such as variance-aware updates and regularization, to improve scalability and sample efficiency.

The formalization of SAMGs, together with existence theory for robust agent policies and practical algorithms for robust MARL, establishes a rigorous foundation for ongoing research into reliable decision-making under adversarial state uncertainty (Han et al., 2022, Wei et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to State-Adversarial Markov Games (SAMGs).