Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cross-Graph RL for Pursuit-Evasion Games

Updated 28 November 2025
  • Cross-graph RL is an approach that generalizes pursuit-evasion strategies across diverse graph structures using GNN-based encodings and belief preservation mechanisms.
  • It integrates cross-instance policy training with adversarial dynamic programming to achieve robust, real-time performance under partial observability.
  • Empirical results show significant improvements in capture rates and scalability, enabling zero-shot generalization on large, unseen graph topologies.

Cross-graph reinforcement learning is an approach to learning pursuit-evasion strategies that can generalize efficiently across diverse graph structures, especially in the context of partial observability, asynchronous moves, and real-time operational requirements in pursuit-evasion games (PEGs). By leveraging graph neural network (GNN) architectures and cross-instance policy training, these methods overcome the limitations of earlier approaches that could only guarantee optimality or robustness in static scenarios or for single graphs. Recent advances incorporate belief preservation mechanisms, robust adversarial training against dynamic programming (DP)-optimal evaders, and scalable GNN encodings, establishing cross-graph RL as a state-of-the-art paradigm for real-time, zero-shot generalization in graph-based pursuit-evasion domains (Lu et al., 21 Nov 2025).

1. Formalization of Cross-Graph Pursuit-Evasion Games

PEGs on graphs model the interaction of multiple pursuers and an evader as a discrete-time, adversarial game over a network G=(V,E)\mathcal G = (\mathcal V, \mathcal E), with pursuers and the evader occupying vertices and moving to neighboring vertices or remaining in place. Under partial observability, pursuers cannot observe the evader’s position unless it falls within a restricted sensor range rr; the evader, conversely, can observe the pursuers' positions and act strategically with full or partial information.

In this setting, the state is lifted to include not only pursuer positions but also a belief distribution bΔ(V)b\in\Delta(\mathcal V) (and its support $\Pos$) encoding the pursuers' probabilistic knowledge of the evader’s possible locations. The cross-graph RL problem is formulated as learning a robust pursuer policy πθ\pi_\theta that accepts as input the current graph, agent positions, belief-support, and node/agent features, then outputs an action for each pursuer. Critically, the policy must generalize across different graph topologies and initializations, making per-graph training infeasible at scale (Lu et al., 21 Nov 2025).

2. Belief Preservation and Partial Observability

A central technical challenge for cross-graph RL in PEGs is partial observability: pursuers often do not have direct access to the evader’s location except when in close proximity. As shown in (Lu et al., 21 Nov 2025), a lightweight belief preservation mechanism allows pursuers to propagate and refine the support $\Pos_t$ and posterior btb_t of plausible evader positions at each time step. The update is as follows:

  • If the evader is observed, $\Pos_{t+1}$ collapses to the singleton containing the observed node.
  • If unobserved, the belief spreads to all neighbors of $\Pos_t$ (consistent with possible evader moves), with any nodes currently visible to pursuers excluded from support.

This induces a one-sided POMDP structure where the agent’s optimal policy is belief-conditional. The DP-based reference policy can be approximated in closed form by a belief-weighted minimax, where at each step the pursuers commit to actions minimizing the worst-case future expected “distance-to-capture” over bb (Lu et al., 21 Nov 2025).

3. Cross-Graph RL Architecture and Equilibrium Policy Generalization

To enable generalization across many graph instances, node features are featurized by GNN layers that encode local connectivity, pursuer distances, membership in $\Pos$, and belief mass b(v)b(v) for each vVv\in\mathcal V. The GNN output is decoded via a pointer network or similar mechanism to yield amenable action representations even for large or variable-sized graphs.

Cross-graph RL training alternates between simulated games on a corpus of training graphs {Gi}\{G_i\}. In each episode, the pursuer’s policy πθ\pi_{\theta} is adversarially trained against the DP-optimal asynchronous evader policy ν\nu^* precomputed for that graph, which is robust to any pursuer policy. The loss includes both the RL objective (e.g., MAPPO or SAC) and a KL-divergence regularization toward the DP reference policy, which facilitates convergence to worst-case-robust actions.

The RL agent thus learns not only to exploit structural regularities common to large classes of graphs, but also to behave optimally—according to the minimax principle—against the hardest evader in any particular instance. The result is a single policy πθ\pi_\theta capable of zero-shot generalization to held-out graphs with different size, topology, and observability regimes (Lu et al., 21 Nov 2025).

4. Dynamic Programming for Robust Reference Strategies

A critical substrate for cross-graph RL is the use of DP to compute optimal pursuit and evasion strategies under asynchronous moves and full/partial observability. The Bellman–type minimax recursion (see Lemma 1 in (Lu et al., 21 Nov 2025)) for worst-case robust pursuit under asynchronous evader moves is: $D(n_p, n_e) = \min_{s_p\in\Neighbor(n_p)}\max_{s_e\in\Neighbor(n_e)}\{ D(s_p, s_e) \} + 1,$ with D(s)D(s) storing the minimum time-to-capture from state ss.

For partial observability, the pursuers’ policy is extended by replacing the max over ses_e with a belief-weighted average; i.e., the action minimizes the expected future time-to-capture over bb. The DP oracles are used both for behavioral cloning (in the RL objective) and as adversarial agents in training (Lu et al., 21 Nov 2025).

5. Empirical Performance and Real-Time Scalability

Extensive experiments across hundreds of small to medium real-world graphs (100–231 nodes, degree 2.3–3.9) demonstrate that cross-graph RL policies trained via the Equilibrium Policy Generalization (EPG) scheme achieve robust capture rates against DP-optimal evaders—significantly outperforming per-graph RL baselines (PSRO) (Lu et al., 21 Nov 2025). Inference on large graphs (\sim2,000 nodes) remains under 0.01s per step on commodity hardware. Zero-shot generalization is evidenced by consistent capture rates on previously unseen graph topologies with no post-training adaptation.

The policy's ability to embed partial belief support, real-time node features, and structural context into actionable strategies accounts for observed robustness in unseen environments. The approach also formally closes the gap between theoretically optimal but infeasible DP strategies and brittle, overfit RL policies.

6. Extensions, Limitations, and Open Challenges

Research has begun extending cross-graph RL to settings with more general POMDP observation models, decentralized information, and varying numbers or types of agents. One limitation is the scalability of the DP oracle, which becomes prohibitive beyond several thousand nodes. There is also open work on incorporating richer history-dependent belief updates, model uncertainty, and coordination in the presence of communication or computation constraints.

Potential advances include end-to-end learned RL architectures that replace tabular belief and DP oracles with differentiable planning modules; more sophisticated graph-to-sequence decoders; and scaling to non-discrete, dynamic, or time-varying graph environments. Further, integrating model-based and model-free RL methods may enhance robustness against novel and adaptive evaders.

7. Relation to Adjacent Domains

Cross-graph RL extends classical pursuit-evasion and search strategies (e.g., sensor-based pursuit (Krishnamoorthy et al., 2014)) and advances over earlier single-graph or heuristic methods by addressing partial observability and heterogeneity of input graphs. It connects deeply with recent GNN-based RL for general combinatorial problems and adversarial planning, and contrasts with approaches for geometric or continuous domains (e.g., area-minimization or visibility-constraint PEGs) that operate in Euclidean or hybrid spaces (Mammadov et al., 19 Nov 2025, Zhou et al., 2024).

In summary, cross-graph reinforcement learning constitutes the current state-of-the-art for scalable, real-time, worst-case robust policy synthesis in graph-based pursuit-evasion games under partial observability, and lays the foundation for further generalization to multi-agent dynamic environments (Lu et al., 21 Nov 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross-Graph Reinforcement Learning.