Agents of Change: Self-Evolving LLM Agents for Strategic Planning

Published 5 Jun 2025 in cs.AI | (2506.04651v1)

Abstract: Recent advances in LLMs have enabled their use as autonomous agents across a range of tasks, yet they continue to struggle with formulating and adhering to coherent long-term strategies. In this paper, we investigate whether LLM agents can self-improve when placed in environments that explicitly challenge their strategic planning abilities. Using the board game Settlers of Catan, accessed through the open-source Catanatron framework, we benchmark a progression of LLM-based agents, from a simple game-playing agent to systems capable of autonomously rewriting their own prompts and their player agent's code. We introduce a multi-agent architecture in which specialized roles (Analyzer, Researcher, Coder, and Player) collaborate to iteratively analyze gameplay, research new strategies, and modify the agent's logic or prompt. By comparing manually crafted agents to those evolved entirely by LLMs, we evaluate how effectively these systems can diagnose failure and adapt over time. Our results show that self-evolving agents, particularly when powered by models like Claude 3.7 and GPT-4o, outperform static baselines by autonomously adopting their strategies, passing along sample behavior to game-playing agents, and demonstrating adaptive reasoning over multiple iterations.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a self-evolving framework for LLM agents, achieving a 95% improvement in average victory points through the PromptEvolver architecture.
It compares multiple agent architectures from basic action mapping to autonomous code rewriting, revealing challenges in strategic coherence with advanced models.
The study highlights the potential and limitations of self-improving LLMs in partially observable environments, guiding future research in hybrid AI reasoning.

Self-Evolving LLM Agents for Strategic Planning in Settlers of Catan

The paper "Agents of Change: Self-Evolving LLM Agents for Strategic Planning" explores the enhancement of strategic planning in LLM agents through a novel self-improvement framework. By leveraging the open-source Catanatron framework, the researchers benchmarked various LLM-based agent architectures capable of autonomously refining their gaming strategies in the complex environment of Settlers of Catan. This analysis aims to assess how effectively LLMs can self-evolve and improve long-term strategic gameplay.

Agent Architectures and Methodology

The study introduces several agent architectures, each with progressively more advanced self-improvement capabilities. It explores a direct LLM-to-action mapping (BaseAgent), human-prompt engineering (StructuredAgent), prompt iteration (PromptEvolver), and autonomous code rewriting (AgentEvolver). The research investigates whether multi-agent collaborative frameworks, drawing inspiration from AI models like AutoGPT and Reflexion, can autonomously enhance strategic planning in partially observable environments like Catan.

Figure 1: Overview of Catan gameplay and LLM-agent interaction.

These LLM-driven agents utilize structured game state representations to inform turn-by-turn decisions. The evaluation focuses on the agents' ability to outperform a heuristic-based bot, AlphaBeta, measuring metrics such as average victory points and strategic milestones like longest road and largest army.

Key Findings

The experiments highlight several significant findings regarding the capabilities and limitations of self-evolving LLM agents:

PromptEvolver Performance: Using the Claude 3.7 model, the PromptEvolver architecture yielded a 95% improvement over the BaseAgent in terms of average victory points, showing notable effectiveness in adapting strategic prompts to dynamic game contexts.
Figure 2: Diagrams of the LLM-based Agent Architectures.
AgentEvolver Results: The AgentEvolver, employing a multi-agent setup with roles like Analyzer, Coder, and Strategizer, demonstrated the ability to autonomously create and iteratively improve game-playing code. However, despite improvements, these agents did not consistently surpass the performance of simpler LLM frameworks or the highly optimized AlphaBeta bot.

Figure 3: Average Game Length.

Discussion of Evolutionary Processes

PromptEvolver Architecture

The PromptEvolver's success varied significantly across different models. While Claude 3.7 showed remarkable strategic improvement, Mistral Large struggled to effectively leverage self-improvement capabilities, demonstrating the dependency of agent performance on the LLM's inherent strategic reasoning capability.

AgentEvolver Challenges

The AgentEvolver architecture, although demonstrating the potential for autonomous code generation, faced limitations with strategic coherence and more complex gameplay dynamics. The models showed promising results in self-directed learning and interfacing with the Catanatron framework, suggesting potential in exploring similar evolutionary architectures in other strategic contexts.

Figure 4: Mistral Large.

Limitations and Further Research

Several limitations persist in this research, including high computational costs and restricted generalization to other environments. The approach also reveals limitations tied to the base model's strategic reasoning ability. Continued exploration into reinforcement learning baselines and hybrid architectures combining symbolic and neural reasoning could yield advancements in autonomous strategic planning.

Conclusion

This study demonstrates the growing potential of LLMs to develop strategic capabilities autonomously, expanding their role from passive task execution to active self-improvement and strategic design. The findings underscore the emergent nature of LLMs in complex decision-making tasks, showcasing their potential to refine long-term strategies with minimal human intervention. Future research should explore enhanced generalization, improved integration of symbolic reasoning, and expansion to broader strategic environments.