GuanDan AI Agents: Multi-Agent Deep RL Methods

Updated 7 February 2026

GuanDan AI agents are computational systems designed to autonomously play the complex Chinese climbing card game with high combinatorial action spaces and imperfect information.
They integrate deep Monte Carlo reinforcement learning, policy optimization with PPO, and behavior regularization to efficiently handle sparse rewards and large state spaces.
LLM-based and theory-of-mind agents enhance strategic decision-making through structured prompts and external RL recommenders, improving partner modeling in multi-agent setups.

GuanDan AI agents are computational systems designed to autonomously play the Chinese climbing game Guandan, a four-player partnership card game characterized by imperfect information, high combinatorial action and state complexity, and emergent cooperative–competitive dynamics. Research into Guandan AI has yielded not only domain-specialized agents leveraging deep reinforcement learning and search, but also benchmarking platforms for multi-agent sequential reasoning, partner modeling, and LLM-based planning (Lu et al., 2022, Zhao et al., 2023, Yanggong et al., 2024, Yim et al., 2024, Li et al., 31 Jan 2026).

1. Formalization and Game-Theoretic Structure

Guandan is formally modeled as a partially observable stochastic game (POSG) involving $N=\{0,1,2,3\}$ agents organized in opposing partnerships. The full game state $s$ includes the public shuffling of two 54-card decks, private 27-card hands per player, the cumulative play history $H$ , and team level progressions ( $L$ in $\{2,3,\dots,A\}$ ). At each turn $t$ , the current player selects an action $a_t$ from the legal action set $A_t$ , which covers all play-legal card combinations (single, pair, bomb, etc.), passes, and tribute/back-tribute actions during dedicated phases (Li et al., 31 Jan 2026, Zhao et al., 2023).

Agent observations $O_i(s_t)$ expose only local hand, publicly played cards, and trick history, inducing imperfect information. The reward function $r^i$ is nonzero only at the end of a round depending on finishing positions; for the winning team, rewards are distributed as $s$ 0, mirrored negatively for opponents. Credit assignment therefore requires propagating sparse terminal payoffs.

The complexity of Guandan is evidenced by the action-space size (up to $s$ 1 legal moves per turn), enormous information set cardinality ( $s$ 2), and long-horizon dependencies across multi-round level progression (Yanggong et al., 2024).

2. Agent Architectures and Learning Paradigms

The foundational AI approach for Guandan is deep Monte Carlo (DMC) RL, exemplified by the DanZero and SDMC agents (Lu et al., 2022, Zhao et al., 2023, Li et al., 31 Jan 2026). State features are extensively engineered, concatenating hand encodings, partner/opponent history, public card distributions, level status, and wild-card flags. Inputs are structured into high-dimensional vectors—e.g., 513-dim for state (DanZero), or concatenations of 14 observation segments (OpenGuanDan).

For each legal action $s$ 3 at state $s$ 4, $s$ 5 is encoded and passed to a deep neural regressor $s$ 6, typically an MLP (e.g., 4–6 hidden layers of 512–1024 units), possibly using LSTM on historical features (Yanggong et al., 2024). Action selection uses $s$ 7-greedy or soft top- $s$ 8 sampling. Learning is via mean-squared error minimization between Q-estimates and Monte Carlo returns: $s$ 9 where $H$ 0 is the sum of future rewards to terminal (Lu et al., 2022, Zhao et al., 2023, Yanggong et al., 2024).

2.2 Policy Optimization and Subgame-Solving

Policy-based RL, notably PPO, augments DMC by training a policy/value MLP over a reduced candidate set of top- $H$ 1 actions scored by a frozen DMC model. The PPO objective

$H$ 2

supports stable improvement in policy quality (Zhao et al., 2023).

Subgame refinement, as in the GS2 agent of OpenGuanDan, constructs local subgames leveraging pre-learned blueprints and applies depth-limited CFR within reachable infosets, boosting robustness at the cost of inference speed (Li et al., 31 Jan 2026).

2.3 Behavior Regulation via Input Flags

The GuanZero architecture introduces explicit behavior regularization: neural network inputs are augmented with flags encoding whether the agent can and does cooperate (e.g., pass to help partner), dwarf (play to restrict minimal hand-size), or assist (choose a lead to benefit a teammate). These 9-dimensional one-hot flags serve as "behavioral regularizers," inducing learning of human-like partnership tactics (Yanggong et al., 2024).

2.4 LLM-Based and Theory of Mind Agents

Recent work examines LLMs as action planners and partners in Guandan (Yim et al., 2024). LLM agents process textified state and history into action plans via structured prompts. Explicit "Theory of Mind" (ToM) planning enables LLMs to infer both first-order (opponents' and partners' hidden hands) and second-order (others' beliefs about the agent's own intentions) beliefs, leading to textual strategic justification for candidate actions filtered by an RL-based action recommender (Yim et al., 2024). External RL modules provide essential action ranking, overcoming LLM capacity limits in large action spaces.

3. Distributed Training and System Implementation

Distributed self-play is central for sample efficiency in high-complexity environments. DMC-based agents are typically trained with tens to hundreds of parallel actor processes, each running simulated episodes under current policy parameters, sharing sampled trajectories with a central learner on GPU. Actor–learner parameter sync ensures the online policy tracks global network updates. Replay buffers are managed at large scale (e.g., 65,536 samples, $H$ 3 for DanZero), and parameter update intervals are tuned for stability (Lu et al., 2022, Zhao et al., 2023, Li et al., 31 Jan 2026).

Specialized submodules handle card-play, tribute, and back-tribute phases, with most agents relying on heuristic handovers for non-standard round phases. LLM/ToM agents modularize via tool-call interfaces and prompt templates, with integration for external action recommenders and per-agent API endpoints (as in OpenGuanDan) (Yim et al., 2024, Li et al., 31 Jan 2026).

4. Benchmarks, Evaluation, and Empirical Results

The OpenGuanDan framework (Li et al., 31 Jan 2026) standardizes environment APIs, action/observation wrappers, and GUI, supporting both RL agent and LLM agent integration. Evaluation protocols span:

Pairwise competition: Each agent team plays 1000+ games versus opponents, reporting fraction of rounds with team finishes of 1st & 2nd.
Human–AI matchups: Multiple human teams (stratified by expertise) play hundreds of rounds against agents.
Metrics: Win-rate, average team rank, and inference overhead.

Key findings:

Learning-based agents (SDMC, DanZero) consistently surpass rule-based agents (win rates $H$ 4), with GS2 yielding further 5–10% improvements via subgame solving.
Human-level or near-human-level performance is achieved, but not superhuman—GS2/SDMC secure $H$ 5% against advanced human teams.
ToM-augmented LLMs trail SOTA RL: GPT-4 with 2nd-order ToM achieves a team point average of –0.88 versus Danzero+, compared to Danzero+’s SOTA baseline (Yim et al., 2024).
Language and prompt engineering matter: Chinese LLM prompts for Guandan outperform English due to training-set alignment.
Action filtering via external RL recommenders is essential in large action spaces—removal drops LLM agent performance by over two points on average.

5. Agent Design Tradeoffs, Limitations, and Open Challenges

Effective Guandan AI necessitates:

Rich structured state encodings aggregating private, public, and temporal information to address partial observability.
Comprehensive legal action enumeration at each step, mitigating policy head combinatorial explosion.
Careful feature engineering and hyperparameter tuning, particularly wild-card/level progression tracking.
Monte Carlo credit assignment over long episodes; reward shaping is by design minimal to avoid bias.
Modular APIs enabling hybrid RL–LLM planning and extensibility to tribute-phase learning.

Principal challenges include tribute phases (still usually heuristic), sample inefficiency due to sparse rewards, the absence of robust superhuman play, and high computational requirements. Proposed directions—suggested in benchmark and case studies—include incorporation of explicit belief modeling (e.g., particle filtering for opponent hands), co-training of LLMs with self-play RL data, meta-learned behavioral regularization, hybrid CFR/RL lookahead, and adaptive curriculum learning (Zhao et al., 2023, Yanggong et al., 2024, Li et al., 31 Jan 2026, Yim et al., 2024).

6. Benchmarks and Reproducibility

OpenGuanDan provides an open-source, standardized environment with per-player APIs, reproducible JSON-based observation/action interfaces, and built-in support for both neural and heuristic agents (Li et al., 31 Jan 2026). Public codebases for DMC/PPO agents (DanZero, Danzero+), behavior-regularized networks (GuanZero), ToM-enabled LLM agents, and all supporting scripts are cited to encourage further experimental rigor and extensibility to related imperfect-information, cooperative–competitive card games (Yim et al., 2024, Yanggong et al., 2024, Zhao et al., 2023).

7. Conclusions and Outlook

GuanDan AI research surfaces algorithmic advances in multi-agent RL, partner modeling, credit assignment, and imperfect-information reasoning under massive state/action spaces. Despite notable progress—especially from distributed deep RL architectures and recent modular ToM-augmented LLM agents—no agent yet exhibits universally superhuman play. Benchmarking efforts and open environments facilitate systematic evaluation and cross-comparison, while behavioral regularization and hybrid inference techniques remain promising avenues for bridging remaining gaps in cooperative and adversarial intelligence (Yanggong et al., 2024, Li et al., 31 Jan 2026, Yim et al., 2024).