Papers
Topics
Authors
Recent
Search
2000 character limit reached

GuanDan AI Agents: Multi-Agent Deep RL Methods

Updated 7 February 2026
  • GuanDan AI agents are computational systems designed to autonomously play the complex Chinese climbing card game with high combinatorial action spaces and imperfect information.
  • They integrate deep Monte Carlo reinforcement learning, policy optimization with PPO, and behavior regularization to efficiently handle sparse rewards and large state spaces.
  • LLM-based and theory-of-mind agents enhance strategic decision-making through structured prompts and external RL recommenders, improving partner modeling in multi-agent setups.

GuanDan AI agents are computational systems designed to autonomously play the Chinese climbing game Guandan, a four-player partnership card game characterized by imperfect information, high combinatorial action and state complexity, and emergent cooperative–competitive dynamics. Research into Guandan AI has yielded not only domain-specialized agents leveraging deep reinforcement learning and search, but also benchmarking platforms for multi-agent sequential reasoning, partner modeling, and LLM-based planning (Lu et al., 2022, Zhao et al., 2023, Yanggong et al., 2024, Yim et al., 2024, Li et al., 31 Jan 2026).

1. Formalization and Game-Theoretic Structure

Guandan is formally modeled as a partially observable stochastic game (POSG) involving N={0,1,2,3}N=\{0,1,2,3\} agents organized in opposing partnerships. The full game state ss includes the public shuffling of two 54-card decks, private 27-card hands per player, the cumulative play history HH, and team level progressions (LL in {2,3,,A}\{2,3,\dots,A\}). At each turn tt, the current player selects an action ata_t from the legal action set AtA_t, which covers all play-legal card combinations (single, pair, bomb, etc.), passes, and tribute/back-tribute actions during dedicated phases (Li et al., 31 Jan 2026, Zhao et al., 2023).

Agent observations Oi(st)O_i(s_t) expose only local hand, publicly played cards, and trick history, inducing imperfect information. The reward function rir^i is nonzero only at the end of a round depending on finishing positions; for the winning team, rewards are distributed as +3,+2,+1+3,+2,+1, mirrored negatively for opponents. Credit assignment therefore requires propagating sparse terminal payoffs.

The complexity of Guandan is evidenced by the action-space size (up to 10410^4 legal moves per turn), enormous information set cardinality (1036\sim 10^{36}), and long-horizon dependencies across multi-round level progression (Yanggong et al., 2024).

2. Agent Architectures and Learning Paradigms

The foundational AI approach for Guandan is deep Monte Carlo (DMC) RL, exemplified by the DanZero and SDMC agents (Lu et al., 2022, Zhao et al., 2023, Li et al., 31 Jan 2026). State features are extensively engineered, concatenating hand encodings, partner/opponent history, public card distributions, level status, and wild-card flags. Inputs are structured into high-dimensional vectors—e.g., 513-dim for state (DanZero), or concatenations of 14 observation segments (OpenGuanDan).

For each legal action aa at state ss, (s,a)(s,a) is encoded and passed to a deep neural regressor Qθ(s,a)Q_\theta(s,a), typically an MLP (e.g., 4–6 hidden layers of 512–1024 units), possibly using LSTM on historical features (Yanggong et al., 2024). Action selection uses ε\varepsilon-greedy or soft top-kk sampling. Learning is via mean-squared error minimization between Q-estimates and Monte Carlo returns: L(θ)=E(s,a)[Qθ(s,a)Gt]2L(\theta) = \mathbb{E}_{(s,a)} \left[ Q_\theta(s,a) - G_t \right]^2 where GtG_t is the sum of future rewards to terminal (Lu et al., 2022, Zhao et al., 2023, Yanggong et al., 2024).

2.2 Policy Optimization and Subgame-Solving

Policy-based RL, notably PPO, augments DMC by training a policy/value MLP over a reduced candidate set of top-kk actions scored by a frozen DMC model. The PPO objective

LCLIP(θ)=E[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E} \left[ \min \left(r_t(\theta) \hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t \right)\right]

supports stable improvement in policy quality (Zhao et al., 2023).

Subgame refinement, as in the GS2 agent of OpenGuanDan, constructs local subgames leveraging pre-learned blueprints and applies depth-limited CFR within reachable infosets, boosting robustness at the cost of inference speed (Li et al., 31 Jan 2026).

2.3 Behavior Regulation via Input Flags

The GuanZero architecture introduces explicit behavior regularization: neural network inputs are augmented with flags encoding whether the agent can and does cooperate (e.g., pass to help partner), dwarf (play to restrict minimal hand-size), or assist (choose a lead to benefit a teammate). These 9-dimensional one-hot flags serve as "behavioral regularizers," inducing learning of human-like partnership tactics (Yanggong et al., 2024).

2.4 LLM-Based and Theory of Mind Agents

Recent work examines LLMs as action planners and partners in Guandan (Yim et al., 2024). LLM agents process textified state and history into action plans via structured prompts. Explicit "Theory of Mind" (ToM) planning enables LLMs to infer both first-order (opponents' and partners' hidden hands) and second-order (others' beliefs about the agent's own intentions) beliefs, leading to textual strategic justification for candidate actions filtered by an RL-based action recommender (Yim et al., 2024). External RL modules provide essential action ranking, overcoming LLM capacity limits in large action spaces.

3. Distributed Training and System Implementation

Distributed self-play is central for sample efficiency in high-complexity environments. DMC-based agents are typically trained with tens to hundreds of parallel actor processes, each running simulated episodes under current policy parameters, sharing sampled trajectories with a central learner on GPU. Actor–learner parameter sync ensures the online policy tracks global network updates. Replay buffers are managed at large scale (e.g., 65,536 samples, 10610^6 for DanZero), and parameter update intervals are tuned for stability (Lu et al., 2022, Zhao et al., 2023, Li et al., 31 Jan 2026).

Specialized submodules handle card-play, tribute, and back-tribute phases, with most agents relying on heuristic handovers for non-standard round phases. LLM/ToM agents modularize via tool-call interfaces and prompt templates, with integration for external action recommenders and per-agent API endpoints (as in OpenGuanDan) (Yim et al., 2024, Li et al., 31 Jan 2026).

4. Benchmarks, Evaluation, and Empirical Results

The OpenGuanDan framework (Li et al., 31 Jan 2026) standardizes environment APIs, action/observation wrappers, and GUI, supporting both RL agent and LLM agent integration. Evaluation protocols span:

  • Pairwise competition: Each agent team plays 1000+ games versus opponents, reporting fraction of rounds with team finishes of 1st & 2nd.
  • Human–AI matchups: Multiple human teams (stratified by expertise) play hundreds of rounds against agents.
  • Metrics: Win-rate, average team rank, and inference overhead.

Key findings:

  • Learning-based agents (SDMC, DanZero) consistently surpass rule-based agents (win rates >80%>80\%), with GS2 yielding further 5–10% improvements via subgame solving.
  • Human-level or near-human-level performance is achieved, but not superhuman—GS2/SDMC secure 40\approx 40% against advanced human teams.
  • ToM-augmented LLMs trail SOTA RL: GPT-4 with 2nd-order ToM achieves a team point average of –0.88 versus Danzero+, compared to Danzero+’s SOTA baseline (Yim et al., 2024).
  • Language and prompt engineering matter: Chinese LLM prompts for Guandan outperform English due to training-set alignment.
  • Action filtering via external RL recommenders is essential in large action spaces—removal drops LLM agent performance by over two points on average.

5. Agent Design Tradeoffs, Limitations, and Open Challenges

Effective Guandan AI necessitates:

  • Rich structured state encodings aggregating private, public, and temporal information to address partial observability.
  • Comprehensive legal action enumeration at each step, mitigating policy head combinatorial explosion.
  • Careful feature engineering and hyperparameter tuning, particularly wild-card/level progression tracking.
  • Monte Carlo credit assignment over long episodes; reward shaping is by design minimal to avoid bias.
  • Modular APIs enabling hybrid RL–LLM planning and extensibility to tribute-phase learning.

Principal challenges include tribute phases (still usually heuristic), sample inefficiency due to sparse rewards, the absence of robust superhuman play, and high computational requirements. Proposed directions—suggested in benchmark and case studies—include incorporation of explicit belief modeling (e.g., particle filtering for opponent hands), co-training of LLMs with self-play RL data, meta-learned behavioral regularization, hybrid CFR/RL lookahead, and adaptive curriculum learning (Zhao et al., 2023, Yanggong et al., 2024, Li et al., 31 Jan 2026, Yim et al., 2024).

6. Benchmarks and Reproducibility

OpenGuanDan provides an open-source, standardized environment with per-player APIs, reproducible JSON-based observation/action interfaces, and built-in support for both neural and heuristic agents (Li et al., 31 Jan 2026). Public codebases for DMC/PPO agents (DanZero, Danzero+), behavior-regularized networks (GuanZero), ToM-enabled LLM agents, and all supporting scripts are cited to encourage further experimental rigor and extensibility to related imperfect-information, cooperative–competitive card games (Yim et al., 2024, Yanggong et al., 2024, Zhao et al., 2023).

7. Conclusions and Outlook

GuanDan AI research surfaces algorithmic advances in multi-agent RL, partner modeling, credit assignment, and imperfect-information reasoning under massive state/action spaces. Despite notable progress—especially from distributed deep RL architectures and recent modular ToM-augmented LLM agents—no agent yet exhibits universally superhuman play. Benchmarking efforts and open environments facilitate systematic evaluation and cross-comparison, while behavioral regularization and hybrid inference techniques remain promising avenues for bridging remaining gaps in cooperative and adversarial intelligence (Yanggong et al., 2024, Li et al., 31 Jan 2026, Yim et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GuanDan AI Agents.