State-Adversarial Markov Games
- SAMGs are multi-agent reinforcement learning frameworks that integrate adversarial perturbations to state observations, modeling uncertainty from sensor errors or malicious attacks.
- They introduce robust solution concepts like min-max dynamic programming and robust agent policies to address the breakdown of traditional Nash equilibria.
- Algorithms such as RMA3C empirically improve robustness, achieving up to 58.5% mean reward gains under adversarial noise and scaled agent settings.
A State-Adversarial Markov Game (SAMG) is a formal multi-agent reinforcement learning (MARL) framework that explicitly incorporates adversarial perturbations to agents’ state observations, modeling state uncertainty within Markov games. SAMGs generalize standard Markov games by allowing an adversary to select, at each step, which (possibly perturbed) observations the agents receive—potentially inducing severe robustness challenges for both the existence and computation of equilibrium solutions. The framework has yielded novel solution concepts and algorithms, particularly addressing the breakdown of traditional optimality and Nash equilibrium notions under state uncertainty (Han et al., 2022).
1. Formal Model and Definitions
An SAMG is specified by the tuple :
- : agent set.
- : finite global state space.
- For each , : local action set; joint action , with .
- : Markovian state transition dynamics (not manipulated by the adversary).
- : stage reward for agent , often shared.
- : discount factor.
- For every , the adversary selects a perturbed observation . The collection defines admissible observation sets.
- Agent ’s policy maps local observation to actions; the adversary policy specifies a distribution over .
Each time step, the environment draws the true state , adversary samples from , agents select based on , environment transitions via .
This modeling framework captures threat models where the adversary may represent noisy sensors, malicious attackers, or systemic perceptual bias—providing a controlled, game-theoretic context for analyzing MARL robustness to state perturbations (Han et al., 2022, Wei et al., 2017).
2. Solution Concepts and Value Functions
2.1 Robust Value Functions and Min-Max Dynamic Programming
For policy profile , the robust value function is
This embodies the min-max principle: agents face the worst-case sequence of allowed adversarial perturbations.
This robust value function satisfies a Bellman-type min-max recursion:
2.2 Nonexistence of Standard Optima and Nash Equilibrium
In contrast to standard MARL, SAMGs may lack
- Global robust optima: There need not exist a “totally optimal” policy achieving for all and all under adversarial perturbations.
- Robust Nash equilibrium: Stage-wise Nash equilibria (where each agent best-responds under min-max value) may not be globally consistent: a single policy profile may fail to align all stage-wise equilibria (Theorem 4.7 in (Han et al., 2022)).
2.3 Robust Agent Policy: New Solution Concept
The robust agent policy addresses this deficiency by instead maximizing the worst-case expected start-state value:
where is a start-state distribution. This yields a solution even when global optima and Nash equilibrium fail to exist.
3. Existence Results and Theoretical Properties
Existence theory for robust agent policies in SAMGs proceeds as follows:
- Stage-wise robust value function: For any fixed opponents and adversary, contraction mapping (Banach’s fixed-point theorem) guarantees existence and uniqueness (Theorem 4.5).
- Stage-wise equilibrium: At each state , the associated -agent vs -adversary normal-form game has at least one max-min equilibrium (Theorem 4.6).
- Global robust Nash: In general, no policy profile aligns all per-state stage-wise robust equilibria.
- Robust agent policy: For finite , the functional —with —is continuous over compact policy spaces. By applying the extreme value theorem, a robust agent policy maximizing always exists (Theorem 4.11 in (Han et al., 2022)).
This result is fundamental: it ensures a well-posed optimization objective for learning robust MARL policies under state adversaries.
4. Robust Multi-Agent Adversarial Actor-Critic (RMA3C) Algorithm
The RMA3C algorithm is designed to learn robust policies in SAMGs by solving the min-max problem
using alternating gradient descent (on agent policies) and ascent (on adversary policies), with a centralized critic.
Algorithmic Steps (per agent ):
- Initialize centralized critic , actor , adversary , and targets.
- Training episode:
- At state , adversary samples ; agent samples .
- Execute joint action, observe , store in replay buffer .
- Update:
- Sample minibatch from ; update to minimize TD target error:
where , . - Update by gradient steps on
Policy gradient update:
This alternating update structure operationalizes the theoretical robust agent policy solution within a scalable DRL framework (Han et al., 2022).
5. Empirical Evaluation and Robustness
RMA3C was evaluated on several MARL environments—Cooperative Navigation, Exchange Target, Keep-Away, and Physical Deception (from [Lowe et al. 2017])—under three classes of state perturbations:
- Truncated normal noise .
- Fixed, well-trained adversarial policy (max-episode adversary).
- On-policy adversaries trained jointly with agents.
Baselines considered were MADDPG, M3DDPG (min-max DDPG), and MAPPO. RMA3C achieved significant empirical robustness:
- Up to 58.5% mean reward improvement over baselines under both random and adversarial noise during training.
- Test-time mean reward improvement up to 46.6% (random noise) and up to 54.0% (adversary).
- Robustness sustained in scaling to 4–6 agents in Cooperative Navigation. A plausible implication is that the robust agent policy min-max criterion, as approximated by RMA3C, can substantially improve resilience in adversarial or noisy MARL settings (Han et al., 2022).
6. Relations to General Stochastic Game Learning and Limitations
SAMGs, as presented in (Han et al., 2022), focus on robust MARL under adversarial state perturbation, diverging from classical zero-sum stochastic games (SGs) where both the transition kernel and the reward may be adversarial. The UCSG (Upper Confidence for Stochastic Games) approach (Wei et al., 2017) targets sample-efficient online learning with regret/safety guarantees in zero-sum SGs, incorporating confidence sets and extended value iteration for efficient exploration and optimization.
Key differences:
- SGs (in (Wei et al., 2017)) consider direct adversarial action on system dynamics and reward, and develop algorithms (UCSG) with provable regret/sample complexity bounds relative to the game value, depending on game diameter , state and action set sizes.
- SAMGs’ core novelty lies in adversarial perturbations to the agents’ perception of state rather than to transition dynamics or reward, modeling realistic sensor attacks and uncertainty in MARL.
A limitation in current SAMG theory is the high computational complexity of robust learning (e.g., maintaining and optimizing over adversary policy classes, scalability of alternating min-max optimization), as well as the lack of global robust Nash solutions—necessitating the robust agent policy min-max criterion.
7. Extensions and Open Directions
Suggested directions for further development include:
- Generalizing beyond Markov games with finite state/action spaces to function approximation settings.
- Exploring more intricate adversarial perturbation models, including those affecting transition or reward (blending with SG literature (Wei et al., 2017)).
- Studying learning in settings with partial observability or continuous space.
- Algorithmic improvements, such as variance-aware updates and regularization, to improve scalability and sample efficiency.
The formalization of SAMGs, together with existence theory for robust agent policies and practical algorithms for robust MARL, establishes a rigorous foundation for ongoing research into reliable decision-making under adversarial state uncertainty (Han et al., 2022, Wei et al., 2017).