Game of Thought (GoT): Structured Reasoning

Updated 9 February 2026

Game of Thought (GoT) is a framework that employs graph-based, recursive, and game-theoretic reasoning to enhance problem-solving in intelligent systems.
It integrates partial solutions through thought transformations like generation, aggregation, and refinement, surpassing traditional chain- and tree-based methods.
Empirical evaluations show that GoT reduces errors and query costs in tasks such as sorting, adversarial search, and recursive reasoning across multi-agent settings.

Game of Thought (GoT) refers to a family of frameworks, algorithms, and formal models that leverage graph-based, recursive, or game-theoretic reasoning structures to advance information processing and structured problem solving in intelligent agents—especially LLMs and multi-agent systems. GoT spans several closely related lines of research, including graph-based LLM prompting, strategic information seeking, deception-aware language game play, and formal computational Theory of Mind. Across these instantiations, GoT enables the decomposition of complex problems into interdependent subcomponents represented as graph structures or belief hierarchies, supporting bidirectional reasoning, multi-agent perspective taking, and principled optimization. This entry surveys the principal formalisms, algorithmic techniques, and empirical demonstrations of GoT, drawing from major lines of work (Besta et al., 2023, Lei et al., 2023, Wang et al., 2023, Cui et al., 2 Feb 2026, Zhu et al., 27 Nov 2025).

1. Graph-Based Prompting and Reasoning with LLMs

GoT as a graph-based reasoning framework was introduced to supersede Chain-of-Thought (CoT) and Tree-of-Thought (ToT) approaches by structuring LLM-generated "thoughts" as vertices in an arbitrary directed graph (Besta et al., 2023). In this model, each vertex encodes a partial solution, intermediate result, or hypothesis, while directed edges represent conditional dependencies—often the result of explicit prompt-based conditioning.

A GoT session is formally a 4-tuple $(G, \mathcal{T}, \mathcal{E}, \mathcal{R})$ where:

$G = (V, E, c)$ is a directed graph of LLM thoughts; $V$ the set of vertices, $E \subseteq V \times V$ directed edges, $c:V \rightarrow C$ an optional class labeling.
$\mathcal{T}$ is the set of permitted thought transformations (e.g., Generation, Aggregation, Refinement), each $\tau$ mapping $(G,p_\theta) \mapsto G'$ .
$\mathcal{E}: V\times G\times p_\theta\to\mathbb{R}$ is an evaluator assigning quality scores to thoughts.
$\mathcal{R}$ is a ranker which returns the top- $h$ thoughts.

This framework subsumes CoT (a chain) and ToT (a tree) by supporting internal aggregation (merging of partial solutions), loops (feedback/refinement), and arbitrary cross-connections. The volume of a thought—i.e., the number of vertices reaching a final node—quantifies the integration of prior reasoning. Key thought transformations include:

Generation: Expanding thoughts from parent nodes.
Aggregation: Merging multiple thoughts into composite nodes.
Refinement: Iterative improvement via loops.

GoT's structure enables combining partial solutions into synergistic outcomes—akin to networked cognition or neurological integration. Empirical benchmarking on sorting, set intersection, and document-merging tasks demonstrates GoT yielding substantial quality gains (62% reduction in median sorting error at 31% lower inference cost for $P=128$ digits) over ToT and CoT baselines (Besta et al., 2023).

2. Inspection, Verification, and Logical Rigour

GoT frameworks introduce rigorous multi-pass verification mechanisms that go beyond simple sampling or heuristic scoring. In (Lei et al., 2023), GoT includes explicit "inspection" steps in which candidate reasoning transitions are repeatedly checked by independent LLM instantiations ("inspectors"); only those transitions on which all inspectors agree are accepted into the solution graph.

This inspection-based selection differs fundamentally from ToT's scoring approach: where ToT selects paths based on the probability $P_\text{LLM}(s \mid C)$ exceeding a threshold, GoT's acceptance probability is $(P_\text{LLM}(s_\text{max}|C))^n$ for $n$ inspectors, enforcing stricter path validity and robustly pruning error-prone inferences.

Benchmarking on tasks such as the 24-point game, high-degree polynomial solving, and closed-form sequence derivation demonstrates that GoT:

Raises raw reasoning accuracy by 89.7%, 86%, and 56% over direct LLM prompting;
Outperforms ToT by 23%, 24%, and 15% absolute accuracy on these tasks;
Yields monotonic accuracy gains with additional inspectors, though with diminishing returns beyond $n \approx 5$ .

These experimental results provide empirical support for the hypothesis that graph-based, recursively inspected reasoning significantly enhances LLM robustness in multi-step inferential domains (Lei et al., 2023).

3. Game-Theoretic Reasoning: Information Seeking and Adversarial Robustness

The GoT paradigm also encompasses game-theoretic frameworks for adversarial information-seeking and multi-agent strategy. In (Cui et al., 2 Feb 2026), Game of Thought is formulated as a robust information-seeking procedure for LLMs via a game-theoretic abstraction of the game "Twenty Questions," introducing the Strategic Language Search (SLS) problem.

The SLS problem is defined as a two-player, zero-sum extensive-form game with imperfect information:

The "Chooser" selects a hidden item $s^* \in S$ .
The "Questioner" sequentially asks yes/no questions, aiming to single out $s^*$ with minimal queries.
The search is formalized as an extensive-form game, with the Questioner’s choices forming the information set and the Chooser’s selection acting as an initial adversarial move.

GoT uses depth-limited subgame solving—building local subtrees, attaching heuristic leaf payoffs, and using counterfactual regret minimization (CFR) to approximate Nash equilibrium strategies—which provably yield "safe" (i.e., minimax-optimal, up to the horizon) behavior. Empirical results show that GoT reduces worst-case cost (e.g., number of queries) by 10–40% over prior methods such as Uncertainty-of-Thought (UoT) and direct prompting across domains including object identification, medical diagnosis, and troubleshooting (Cui et al., 2 Feb 2026).

4. Recursive Contemplation and Deception in Language Games

A variant of GoT arises in the context of interactive multi-agent reasoning under deception, as demonstrated in the Avalon game setting (Wang et al., 2023). Here, the "Game-of-Thoughts" is characterized by alternating internal contemplation (private chain-of-thought) and public speech, both of which are explicitly modeled.

The Recursive Contemplation (ReCon) framework structures each turn into:

Formulation contemplation: Generates internal thought and speech drafts using first-order perspective transitions—agents infer others' mental states.
Refinement contemplation: Refines drafts by simulating second-order perspectives (how others will interpret one’s speech), thereby safeguarding against privacy leaks or rhetorical traps.

In experiments with LLMs (GPT-3.5, GPT-4) playing Avalon, ReCon dramatically improves deception detection and privacy preservation:

Good-side win rates increase from 15% (CoT baseline) to 83.3% (ReCon).
GPT-4 prefers ReCon-generated speech over CoT in all dimensions (logic, concealment, persuasiveness, etc.) by substantial margins.

Ablations confirm the additive value of formulation, refinement, and recursive perspective-taking. However, no formal equilibrium guarantees are provided; all findings are empirical (Wang et al., 2023).

5. Game-Theoretic Theory of Mind and Recursive Reasoning

The Game of Thought also underpins a computational Theory of Mind (ToM) framework for multi-agent systems, leveraging boundedly rational, recursively nested beliefs (Zhu et al., 27 Nov 2025). Each agent maintains Bayesian (Poisson-Gamma) beliefs over the reasoning depth of others, instantiates level- $k$ strategies via recursive best-response, and acts according to the induced mixture.

Formally:

System is modeled as a stochastic game $(N,S,A,T,R)$ where agents optimize discounted rewards.
Agents ascribe to others a hierarchy of reasoning levels: level-0 (naïve), through level- $k$ best-responders.
Recursive policies $\pi_j|_k$ are constructed either as pure-level (best-response at each depth) or as mixtures weighed by posterior beliefs over others' reasoning.
Each agent iteratively updates beliefs about others’ levels, solves the induced MDP or QMDP, and best-responds.

This approach yields a fully computable, statistically principled ToM process—capable of dynamically adapting to observed opponent behaviors and supporting strategic anticipation. The framework admits both theoretical analysis and toy demonstrations (e.g., gridworld navigation) (Zhu et al., 27 Nov 2025).

6. Limitations, Extensions, and Future Directions

GoT frameworks, despite demonstrated empirical and theoretical benefits, face several limitations:

Scalability: Subgame solving and recursive simulation scale reasonably for $|S| \lesssim 100$ , but not to thousands of items without further abstraction (Cui et al., 2 Feb 2026).
Formality vs. Human-Likeness: Over-formality in LLM outputs can degrade play in human-facing settings; balancing formality and strategic depth remains open (Wang et al., 2023).
Binary and Restricted Queries: Most frameworks currently support only binary questions or fixed formats; extending to open-ended or multimodal queries is non-trivial (Cui et al., 2 Feb 2026).
Empirical Guarantees: Most performance claims are empirical; formal convergence and equilibrium guarantees are often absent or limited to bounded horizons.
Oracle Fidelity and Robustness: Assumptions such as noiseless oracle answers can be unrealistic; future work must incorporate error-handling and human-in-the-loop variations.

Extension directions include development of new thought transformations (e.g., subgraph distillation), richer multi-agent collaboration (possibly integrating explicit Tool Use as special thought nodes), and reinforcement learning meta-controllers to optimize transformation sequences under cost constraints (Besta et al., 2023).

7. Synthesis and Comparative Summary

GoT, across instantiations, brings two principal advances: (i) graph-based, recursively verifiable reasoning architectures enabling robust problem decomposition and solution aggregation, and (ii) explicit embedding of game-theoretic and Theory-of-Mind principles in reasoning under uncertainty and interaction. These underpinnings enable LLMs and agents to approach large, decomposable, or adversarial problems with heightened robustness, sample efficiency, and interpretability relative to chain- or tree-based methods.

A comparative summary of key GoT paradigms appears below:

Paradigm	Core Mechanism	Principal Domain	Strengths
Graph of Thoughts (GoT)	Directed graph of thoughts	LLM prompting	Arbitrary interdependence, merges, feedback loops, extensibility (Besta et al., 2023, Lei et al., 2023)
Game-theoretic GoT	Subgame equilibrium search	Info-seeking (20Q)	Worst-case optimality, adversarial robustness (Cui et al., 2 Feb 2026)
ReCon in GoT	Recursive contemplation	Deceptive linguistics	Deception resistance, perspective-taking (Wang et al., 2023)
ToM GoT	Recursive level- $k$ beliefs	Multi-agent ToM	Bounded rationality, online belief updates (Zhu et al., 27 Nov 2025)

GoT thus provides a generalizable, theoretically principled substrate for structured reasoning in LLMs and agents, supporting robust performance on elaborate computational and interactive tasks.