AgentSquare: Modular LLM Agent Search
- AgentSquare is an automatic LLM agent search framework that decomposes agent design into four core modules: Planning, Reasoning, Tool Use, and Memory.
- It leverages LLM-driven module evolution, recombination, and in-context performance prediction to optimize architectures for diverse tasks.
- Experiments across web, embodied, tool use, and game domains show significant performance gains over human-engineered baselines with interpretable designs.
AgentSquare is an automatic LLM agent search framework that formalizes and operationalizes the Modularized LLM Agent Search (MoLAS) problem. It defines a systematic and extensible design space for LLM-based agents, decomposing them into four core, swappable modules—Planning, Reasoning, Tool Use, and Memory—with uniform IO interfaces. AgentSquare leverages LLM-driven module evolution, module recombination, and in-context performance prediction to search efficiently for agent architectures that optimize task-specific evaluation functions. Experiments across web, embodied, tool, and game application domains demonstrate that AgentSquare outperforms human-engineered baselines and yields interpretable design patterns for agentic systems (Shang et al., 2024).
1. Modularized LLM Agent Search: Formalization and Motivation
The MoLAS problem is defined by introducing a fixed, standardized module pool containing four module types: Planning , Reasoning , Tool Use , and Memory . Each agent is a tuple , where , , , and . Given a task description and a task-level evaluation function , the optimization objective is:
(Modularized LLM Agent Search Objective; Eq. 1 in (Shang et al., 2024))
The rationale for this modular abstraction is threefold:
- Reusability: Existing agent designs can be decomposed into these modules (Chain-of-Thought ↔ Reasoning, WebGPT’s browser advisor ↔ Tool Use, Voyager’s skill memory ↔ Memory).
- Extensibility: The design space expands as new modules are published and added to any of the four pools.
- Searchability: The uniform IO interface (both for code and LLM prompting) enables automatic swapping, facilitating AutoML-style architecture search instead of manual, task-specific engineering.
2. Modular Design Space: Module Definitions and Interfaces
Each of the four fundamental modules has a well-specified IO contract:
| Module | Input(s) and Output(s) | Functionality Example |
|---|---|---|
| Planning | Decomposes task (plus optional feedback ) into sub-tasks | |
| Reasoning | Solves or reasons about sub-task , possibly via CoT or ToT | |
| Tool Use | Selects tool from tool pool for subtask | |
| Memory | Writes observation/action to memory | |
| Retrieves relevant memory content |
All modules operate over textual input, with optional feedback or context, and output type-specific responses (sub-tasks, solutions, tool choices, or memory states).
3. AgentSquare Search Framework: Evolution, Recombination, and Surrogate Prediction
The AgentSquare framework employs an iterative, population-based search guided by two LLM-driven processes—module evolution () and module recombination ()—and a surrogate performance predictor () to accelerate selection.
3.1 High-Level Search Algorithm
- Initialization: Start with a randomly-sampled agent .
- Module Evolution (): Generate new module code variants for any of , prompt to mutate, recombine, or extend modules, yielding candidate agents , each evaluated and recorded in the experience pool .
- (Eq. 3)
- Module Recombination (): Swap in/out published modules in based on argmax selection over , generating new .
- (Eq. 2)
- Performance Predictor (): For candidate agent , predict using in-context learning and a small set of past agent, performance pairs:
- (Eq. 4)
- Selection and Repeat: Retain best candidates; repeat until convergence or max episodes () reached.
Algorithm 1 in (Shang et al., 2024) details the episode-based alternation, population size per episode (), and pool/experience management.
3.2 LLM Roles
- (proposer): Selects and replaces modules from the pool, informed by real-world performance history.
- (programmer): Mutates/generates module code, exploring beyond previously published designs.
- (predictor): Quickly estimates task performance to reduce expensive full-agent rollouts (0.025% the cost of real evaluation in ALFWorld).
Empirically, the surrogate predictor correlates strongly with real task reward (Pearson or higher across benchmarks).
4. Empirical Evaluation: Benchmarks and Comparative Results
Comprehensive experiments span six benchmarks across four domains:
- Web: WebShop (e-commerce)
- Embodied: ALFWorld (navigation), ScienceWorld (simulated science tasks)
- Tool Use: TravelPlanner (external data search/planning), M3ToolEval (multi-turn tool selection)
- Game: Classical planning via PDDL tasks
Metrics include success rate, progress rate, task score, and micro-pass rate, as appropriate per environment.
Baseline methods:
- 12 prominent human-crafted agents (including Chain-of-Thought, Self-refine, ToT, Step-back, Voyager, HuggingGPT, Generative Agents, DEPS, OPENAGI, DiLu).
- Module-level search: random and Bayesian optimization over tuples.
- Prompt-level search: OPRO (iterative prompt search).
Main results under GPT-4o (see Table 2 in (Shang et al., 2024)):
- An average performance gain of 17.2% over the best-known human-designed baselines.
- Individual task improvements include +26.1% on ALFWorld, +30.6% on M3Tool, +20.5% on ScienceWorld, +14.1% on WebShop.
- The search trajectory (Figure 1) is smooth and monotonic, unlike plateauing trends in random, Bayesian, or module-only search.
- Per-iteration cost and efficiency: e.g., ALFWorld uses %%%%5556%%%%\sim\$252\pi_p$ for cheap, high-throughput candidate screening.</li>
<li>Persisting high-performing modules in official pools for future tasks ("catalogue effect").</li>
</ul>
<p>For example, most tasks converge within 10–20 iterations, and code-level innovations discovered are immediately transferable by virtue of the standardized module interface.</p>
<p>One-time search expense contrasts with repeated per-task engineering in prior work.</p>
<h2 class='paper-heading' id='key-equations-and-formalization-summary'>7. Key Equations and Formalization Summary</h2>
<p>The framework’s main equations and algorithmic operations:</p>
<ul>
<li><strong>MoLAS objective (Eq. 1):</strong></li>
</ul>
<p>$\underset{P \in \mathcal{P}, R \in \mathcal{R}, T \in \mathcal{T}, M \in \mathcal{M}}{\arg\max} \ \text{Eval}_d(P, R, T, M)A_r = \pi_\theta((P_0, R_0, T_0, M_0), d, N, \mathcal{P}, \mathcal{R}, \mathcal{T}, \mathcal{M}, \mathbb{E})A_e = \pi_\xi((P'_0, R'_0, T'_0, M'_0), d, N, \mathcal{P}, \mathcal{R}, \mathcal{T}, \mathcal{M}, \mathbb{E})v' = \pi_p(A', d, \mathcal{P}, \mathcal{R}, \mathcal{T}, \mathcal{M}, \mathbb{E})$
Algorithm 1 describes the detailed alternation, pool management, and experience updating.
In summary, AgentSquare operationalizes LLM agent design as a discrete, standardized, modular search problem, leveraging LLMs both as generative operators (module programming and recombination) and as performance surrogates, resulting in empirically superior and interpretable agent architectures for diverse reasoning and interaction environments (Shang et al., 2024).
References (1)