Superhuman AI for Stratego Using Self-Play Reinforcement Learning and Test-Time Search

Published 10 Nov 2025 in cs.LG and cs.AI | (2511.07312v1)

Abstract: Few classical games have been regarded as such significant benchmarks of artificial intelligence as to have justified training costs in the millions of dollars. Among these, Stratego -- a board wargame exemplifying the challenge of strategic decision making under massive amounts of hidden information -- stands apart as a case where such efforts failed to produce performance at the level of top humans. This work establishes a step change in both performance and cost for Stratego, showing that it is now possible not only to reach the level of top humans, but to achieve vastly superhuman level -- and that doing so requires not an industrial budget, but merely a few thousand dollars. We achieved this result by developing general approaches for self-play reinforcement learning and test-time search under imperfect information.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Ataraxos, a novel AI system that uses self-play reinforcement learning and test-time search to excel in Stratego despite its complex hidden information.
It features dual transformer networks for setup and move selection, dynamic update regularization, and GPU-accelerated simulation achieving high throughput.
Ataraxos achieves an 85% win rate against elite opponents at a fraction of the cost, setting a new benchmark for imperfect-information game AI.

Superhuman Performance in Stratego via Self-Play Reinforcement Learning and Test-Time Search

Introduction

This paper presents Ataraxos, an AI system that achieves superhuman performance in Stratego—a complex imperfect-information board game—using tabula rasa self-play reinforcement learning (RL) and novel test-time search. Unlike previous efforts which required industrial-scale expenditure (e.g., DeepNash [10]), Ataraxos attains dominance at a fraction of the cost ( $\sim$ 8,000). This achievement is technically significant given Stratego’s immense complexity and the unprecedented level of hidden information, which thwarts adaptations of Go/Chess/Poker paradigms. The work details architectural, algorithmic, and empirical innovations needed to solve the imperfect-information RL setting, including separation of learning processes, transformer-based policy/value modeling, dynamically damped RL, belief modeling for opponent hidden states, and specialized GPU-accelerated simulation.

Challenges in Imperfect-Information Game AI

Stratego’s state space for possible piece setups ( $>10^{33}$ ) and play trajectories renders exhaustive enumeration infeasible, and the dependencies among contemporaneous/counterfactual actions (e.g., sandbagging, bluffing) induce highly nonstationary learning dynamics. Successfully training agents in such settings requires mitigating learning instability, cycling, and collapse—tasks aggravated by imperfect information. The paper demonstrates that previously successful methodologies (e.g., DeepStack [11], Pluribus [8], AlphaZero [2], DeepNash [10]) are either inapplicable or prohibitively expensive for Stratego.

Ataraxos System Overview

Self-Play RL Architecture

Ataraxos decomposes training into two interdependent modules handled by separate self-play processes:

Setup Network: Decoder-only transformer autoregressively generates piece placements with dense positional encoding, trained via Monte Carlo returns and maximum entropy regularization.
Move Network: Encoder-only transformer infers legal moves from tokenized board/game history, using a query-key matrix product for efficient move selection and a myopic reverse KL penalty for regularization.

These modules learn separately but in concert—setups determine move-game initial states; move outcomes inform both modules.

Dynamic Damping and Update Regularization

A key contribution is the coordination of learning update size and regularization strength tied to policy maturity. Early training employs large updates and strong regularization, annealed to small updates and weak regularization as policy strength increases. Mechanisms include reverse KL penalties (to prior policies and a “magnet” random-move policy), advantage filtering, gradient norm clipping, and learning rate schedules, all dynamically adapted.

Efficient Simulation and Data Generation

The authors implement a Stratego simulator in GPU-resident CUDA C++ capable of $\sim$ 10M state updates/sec on an H100, integrating the replay buffer for near-linear scaling. Direct policy sampling is found more effective for self-play data generation than search-based variants due to better throughput and sufficient support coverage. Use of bfloat16 and exponential moving parameter averages further accelerates training while maintaining performance stability.

Belief Modeling and Test-Time Search

To address the challenge of reasoning about hidden opponent pieces,

Belief Network: Processes known state and decodes hidden configurations using transformer encoder-decoder architecture with dropout for generalization.
Search Procedure: Prior to each move, Ataraxos samples $\sim$ 1,000 possible hidden states, runs $40$-ply depth-limited rollouts from each candidate move using the move network, and computes average value predictions to perform a magnetic mirror descent update—regularized by KL coefficients and more aggressive than training updates due to locality of effects.

This approach ensures robust move selection against arbitrary opponents and leverages substantial computational investment at decision time.

Training and Evaluation Metrics

Ataraxos was trained for one week (setup/move networks on 16 H100s, belief network on 4 H100s for 4 days), generating $\sim$ 200M finished games in total. The evaluation against four-time world champion Pim Niemeijer yielded a result of 15 wins, 1 loss, 4 draws across 20 games, representing an 85% effective win rate—a margin statistically highly significant (p<0.00026 under i.i.d. assumption). Ataraxos further demonstrated a 95% win rate against a diverse field in live play at the World Championship. The system played faster than elite humans (1.26s/move for full search).

Empirical analyses show Ataraxos draws more frequently and plays significantly longer games than humans, indicating profound differences in risk management and positional inference.

Play Style and Strategic Differences

Expert feedback reveals marked divergences in both setup and play style:

Frequent use of aggressive, less-predictable setups (e.g., high-value pieces forward, bombed-in Flags in back corners).
Less reliance on human-standard bluffs, more deliberate preservation of information utility.
Superior endgame play, defending, adaptation from information deficit, and long-term positional punishment.
Willingness to draw in openings if progression is counterproductive.
Outlier tactics perceived by humans as "arrogant" or unnaturally lucky, underlining a fundamental difference in information processing and exploitation.

Comparative Analysis: DeepNash vs. Ataraxos

DeepNash required over 3 million USD in compute (1024 TPU nodes, 2-3 months), was not evaluated against elite contemporaries, and did not attain top site rankings. Ataraxos, trained for $\sim$ 8,000 on commodity hardware and tested against the best human and championship field, clearly surpasses previous efforts both in cost efficiency and performance, establishing a new state of the art for AI in imperfect-information games.

Future Directions

Belief Model: Stronger compute-normalized performance is likely achievable via architectures incorporating temporal or recurrent features, possibly at tradeoffs in runtime/memory for RL but with promising gains for belief updates.
Search Algorithms: Improvements are theoretically bounded by single-step approximations; multi-step, subgame solving, or knowledge-limited approaches could further exploit available compute.
Generalization: The demonstrated practicality of strategically superhuman AI at non-industrial cost presages broad application to settings with fast, accurate simulators and extensive hidden information (e.g., finance, security, negotiation).

Conclusion

The combination of modern RL and principled search, when married to transformers and GPU-centric simulation, is now sufficient to master strategic decision making in domains with massive hidden-information state spaces. Ataraxos establishes a new benchmark in Stratego and signals that cost-effective, superhuman AI is viable for many real-world imperfect-information problems with reasonable simulation throughput. The work opens research avenues for further architectural, algorithmic, and hardware optimizations, and has material implications for the future of AI agents in strategic settings.