LLM Go Evaluation Benchmark
- LLM Go Evaluation Benchmark is a comprehensive framework that standardizes next-move prediction and full-game play assessments for LLMs in the game of Go.
- It utilizes large-scale datasets from millions of human-played games, enriched with detailed annotations and top-10 move candidates from KataGo for robust evaluation.
- The benchmark highlights the performance gap between general LLMs and specialized engines, guiding improvements through tailored reinforcement learning and strategic reasoning methodologies.
LLM Go Evaluation Benchmarks are standardized protocols and datasets designed to rigorously assess the capabilities of LLMs in performing Go-related tasks, including both static next-move prediction and dynamic full-game play. These benchmarks emerged in response to the marked performance gap between general-purpose LLMs and Go-specific AI agents, serving as a platform for quantifiable comparison and targeted advancement of LLM strategic reasoning and gameplay proficiency (Ma et al., 23 Jan 2026). The introduction of holistic evaluation frameworks—such as the KataGo-Bench-1K—has enabled clear separation of reasoning skills, empirical calibration against professional engines and human expertise, and methodical progress tracking in specialized domains.
1. Benchmark Structure and Task Formulation
LLM Go Evaluation Benchmarks typically feature two complementary components, each probing different facets of model competence. The static “KataGo-Bench-1K” next-move prediction task presents a sequence of historical moves from human-played 19×19 games, requiring the LLM to forecast the coordinate of the next stone with reference to KataGo—an open-source Monte Carlo Tree Search (MCTS) Go engine. Success is defined as the LLM's predicted move appearing within KataGo’s top-10 candidates for a given board state.
The dynamic full-game match protocol evaluates an LLM by direct competitive play against baseline models and engines under standard Chinese/area scoring rules. The LLM must select moves in an autoregressive fashion, alternating with its opponent, with victory determined by final territory counts. This dual-axis approach distinguishes static recall accuracy (the ability to match professional-level intuition) from dynamic performance (actual win rate against strong adversaries) (Ma et al., 23 Jan 2026).
2. Dataset Composition and Annotation Procedures
The development of the Go evaluation benchmarks involves curation of multiple large-scale datasets specifically tailored to both training and evaluation:
- Next-Move Prediction Corpus: Sourced from 5 million human-played Go records, providing 10 million uniformly sampled mid-game positions. Each position is annotated by KataGo with top-10 candidate moves, per-move win-rate estimates, and one-step rollouts to create robust ground-truth references.
- Natural Language Commentary Dataset: Comprises 100,000 human-written commentaries linked to individual board states, emphasizing both strategic context and proper Go terminology.
- Chain-of-Thought Reasoning Data: The Go-specific datasets are mixed with general reasoning corpora (including Openthoughts-114K and NuminaMath-QwQ-CoT-5M), yielding a comprehensive set of over 6 million CoT examples in code, math, and broader inference domains.
Each data point is embedded in a heuristic reasoning template, requiring models to (1) identify the next player to act, (2) analyze move candidates with brief strategic variations, (3) summarize the optimal decision, and (4) output a structured move prediction with associated win-rate. This annotation procedure enforces both semantic precision and strategic depth (Ma et al., 23 Jan 2026).
3. Evaluation Protocols and Metrics
LLM Go benchmarks utilize a standardized suite of metrics and experimental controls:
- Next-Move Accuracy: For test set of size , the metric is , where is the model’s prediction and the top-10 KataGo moves.
- Win Rate: In dynamic matches, .
- Elo Rating: Tournament outcomes are mapped to Elo ratings, computed as with post-match updates via , using .
- Correlation: Elo ratings are empirically validated by their strong correlation () with static move recall, ensuring that next-move accuracy is a fair proxy for overall playing strength.
All competitive matches use industry-standard board representations, with carefully sampled queries and strict separation between training and test splits to prevent information leakage (Ma et al., 23 Jan 2026).
4. Baseline Models and Experimental Results
Benchmarks include both general-purpose LLMs and Go-specialized engines as baselines. Table 1 below summarizes static next-move prediction accuracy (KataGo-Bench-1K):
| Model | KataGo-Bench (%) |
|---|---|
| DeepSeek-R1 | 17.6 |
| o1-mini | 27.3 |
| Claude3.7-Sonnet | 34.3 |
| Qwen2.5-7B-Instruct | 8.0 |
| LoGos(7B) | 88.1 |
| LoGos(32B) | 88.6 |
| KataGo-HumanSL-9d | 87.8 |
General LLMs—including DeepSeek-R1, Claude-3.7-Sonnet, and Qwen variants—consistently underperform (<35% accuracy) compared to Go-specific engines and LoGos. The LoGos model, trained with mixed Go expertise and broad CoT data followed by Group Relative Policy Optimization, surpasses even top-amateur engines (KataGo-HumanSL-9d) and achieves ≈50% win rate against professional-level artificial opponents. LoGos wins >95% of games against all LLM-shaped baselines (Ma et al., 23 Jan 2026).
5. Training Methodologies and Integration of Expert Knowledge
Model training leverages a two-stage pipeline:
- Mixed Cold-Start Supervised Fine-Tuning: LoGos is trained on 10 million next-move samples, 100K commentaries, and 6M general CoT reasoning corpus. Sequence lengths reach 32,768 tokens; batch sizes up to 512; hardware includes arrays of A800 GPUs.
- Self-Exploration via Reinforcement Learning: The Group Relative Policy Optimization (GRPO) algorithm omits simple top-1 rewards in favor of tiered feedback based on both rank and win-rate proximity. Reward function parameters () enforce fine-grained matching with KataGo’s outputs, penalizing deviation from optimal moves but also rewarding strategic exploration.
Ablation studies reveal that omitting Go-specific cold-start, heuristic templates, or using sparse rewards sharply limits achievable accuracy (ceilings ≤67%), indicating the necessity of domain-specific scaffolding for high-fidelity Go reasoning. Rendering board states as 2-D arrays further mitigates long-range context degradation, maintaining accuracy even on deep (>200 move) sequences (Ma et al., 23 Jan 2026).
6. Interpretative Insights and Forward Directions
The LLM Go evaluation benchmark sets a robust foundation for systematic quantification of reasoning and gameplay in a domain where search-based AI previously set the performance ceiling. Predominantly, the striking delta between LoGos and general LLMs highlights the inadequacy of naïve next-move fitting and the importance of mixed data regimes plus tailored RL. High correlation between static recall and dynamic Elo substantiates the use of KataGo-Bench-1K as a meaningful proxy for true playing strength.
A plausible implication is that rigorous benchmarking in Go—encompassing reasoning templates, diverse human-played positions, and fine-grained evaluation—can inform development of LLMs in other combinatorial games and decision sciences. The benchmark’s release opens avenues for further research on natural-language reasoning, transfer learning, and game-specific self-optimization protocols for LLMs (Ma et al., 23 Jan 2026).