AgentSquare: Modular LLM Agent Search

Updated 19 January 2026

AgentSquare is an automatic LLM agent search framework that decomposes agent design into four core modules: Planning, Reasoning, Tool Use, and Memory.
It leverages LLM-driven module evolution, recombination, and in-context performance prediction to optimize architectures for diverse tasks.
Experiments across web, embodied, tool use, and game domains show significant performance gains over human-engineered baselines with interpretable designs.

AgentSquare is an automatic LLM agent search framework that formalizes and operationalizes the Modularized LLM Agent Search (MoLAS) problem. It defines a systematic and extensible design space for LLM-based agents, decomposing them into four core, swappable modules—Planning, Reasoning, Tool Use, and Memory—with uniform IO interfaces. AgentSquare leverages LLM-driven module evolution, module recombination, and in-context performance prediction to search efficiently for agent architectures that optimize task-specific evaluation functions. Experiments across web, embodied, tool, and game application domains demonstrate that AgentSquare outperforms human-engineered baselines and yields interpretable design patterns for agentic systems (Shang et al., 2024).

1. Modularized LLM Agent Search: Formalization and Motivation

The MoLAS problem is defined by introducing a fixed, standardized module pool containing four module types: Planning $(\mathcal{P})$ , Reasoning $(\mathcal{R})$ , Tool Use $(\mathcal{T})$ , and Memory $(\mathcal{M})$ . Each agent is a tuple $A = (P, R, T, M)$ , where $P \in \mathcal{P}$ , $R \in \mathcal{R}$ , $T \in \mathcal{T}$ , and $M \in \mathcal{M}$ . Given a task description $d$ and a task-level evaluation function $(\mathcal{R})$ 0, the optimization objective is:

$(\mathcal{R})$ 1

(Modularized LLM Agent Search Objective; Eq. 1 in (Shang et al., 2024))

The rationale for this modular abstraction is threefold:

Reusability: Existing agent designs can be decomposed into these modules (Chain-of-Thought ↔ Reasoning, WebGPT’s browser advisor ↔ Tool Use, Voyager’s skill memory ↔ Memory).
Extensibility: The design space expands as new modules are published and added to any of the four pools.
Searchability: The uniform IO interface (both for code and LLM prompting) enables automatic swapping, facilitating AutoML-style architecture search instead of manual, task-specific engineering.

2. Modular Design Space: Module Definitions and Interfaces

Each of the four fundamental modules has a well-specified IO contract:

Module	Input(s) and Output(s)	Functionality Example
Planning	$(\mathcal{R})$ 2	Decomposes task $(\mathcal{R})$ 3 (plus optional feedback $(\mathcal{R})$ 4) into sub-tasks $(\mathcal{R})$ 5
Reasoning	$(\mathcal{R})$ 6	Solves or reasons about sub-task $(\mathcal{R})$ 7, possibly via CoT or ToT
Tool Use	$(\mathcal{R})$ 8	Selects tool $(\mathcal{R})$ 9 from tool pool $(\mathcal{T})$ 0 for subtask $(\mathcal{T})$ 1
Memory	$(\mathcal{T})$ 2	Writes observation/action $(\mathcal{T})$ 3 to memory
	$(\mathcal{T})$ 4	Retrieves relevant memory content $(\mathcal{T})$ 5

All modules operate over textual input, with optional feedback or context, and output type-specific responses (sub-tasks, solutions, tool choices, or memory states).

3. AgentSquare Search Framework: Evolution, Recombination, and Surrogate Prediction

The AgentSquare framework employs an iterative, population-based search guided by two LLM-driven processes—module evolution ( $(\mathcal{T})$ 6) and module recombination ( $(\mathcal{T})$ 7)—and a surrogate performance predictor ( $(\mathcal{T})$ 8) to accelerate selection.

3.1 High-Level Search Algorithm

Initialization: Start with a randomly-sampled agent $(\mathcal{T})$ 9.
Module Evolution ( $(\mathcal{M})$ 0): Generate new module code variants for any of $(\mathcal{M})$ $(M)$ 1, prompt $(\mathcal{M})$ $(M)$ 2 to mutate, recombine, or extend modules, yielding candidate agents $(\mathcal{M})$ $(M)$ 3, each evaluated and recorded in the experience pool $(\mathcal{M})$ $(M)$ 4.
- $(\mathcal{M})$ 5 (Eq. 3)
Module Recombination ( $(\mathcal{M})$ 6): Swap in/out published modules in $(\mathcal{M})$ $(M)$ 7 based on argmax selection over $(\mathcal{M})$ $(M)$ 8, generating new $(\mathcal{M})$ $(M)$ 9.
- $A = (P, R, T, M)$ 0 (Eq. 2)
Performance Predictor ( $A = (P, R, T, M)$ 1): For candidate agent $A = (P, R, T, M)$ $A = (P, R, T, M)$ 2, predict $A = (P, R, T, M)$ $A = (P, R, T, M)$ 3 using in-context learning and a small set of past $A = (P, R, T, M)$ $A = (P, R, T, M)$ 4agent, performance $A = (P, R, T, M)$ $A = (P, R, T, M)$ 5 pairs:
- $A = (P, R, T, M)$ 6 (Eq. 4)
Selection and Repeat: Retain best candidates; repeat until convergence or max episodes ( $A = (P, R, T, M)$ 7) reached.

Algorithm 1 in (Shang et al., 2024) details the episode-based alternation, population size per episode ( $A = (P, R, T, M)$ 8), and pool/experience management.

3.2 LLM Roles

$A = (P, R, T, M)$ 9 (proposer): Selects and replaces modules from the pool, informed by real-world performance history.
$P \in \mathcal{P}$ 0 (programmer): Mutates/generates module code, exploring beyond previously published designs.
$P \in \mathcal{P}$ 1 (predictor): Quickly estimates task performance to reduce expensive full-agent rollouts ( $P \in \mathcal{P}$ 20.025% the cost of real evaluation in ALFWorld).

Empirically, the surrogate predictor correlates strongly with real task reward (Pearson $P \in \mathcal{P}$ 3 or higher across benchmarks).

4. Empirical Evaluation: Benchmarks and Comparative Results

Comprehensive experiments span six benchmarks across four domains:

Web: WebShop (e-commerce)
Embodied: ALFWorld (navigation), ScienceWorld (simulated science tasks)
Tool Use: TravelPlanner (external data search/planning), M3ToolEval (multi-turn tool selection)
Game: Classical planning via PDDL tasks

Metrics include success rate, progress rate, task score, and micro-pass rate, as appropriate per environment.

Baseline methods:

12 prominent human-crafted agents (including Chain-of-Thought, Self-refine, ToT, Step-back, Voyager, HuggingGPT, Generative Agents, DEPS, OPENAGI, DiLu).
Module-level search: random and Bayesian optimization over $P \in \mathcal{P}$ 4 tuples.
Prompt-level search: OPRO (iterative prompt search).

Main results under GPT-4o (see Table 2 in (Shang et al., 2024)):

An average performance gain of 17.2% over the best-known human-designed baselines.
Individual task improvements include +26.1% on ALFWorld, +30.6% on M3Tool, +20.5% on ScienceWorld, +14.1% on WebShop.
The search trajectory (Figure 1) is smooth and monotonic, unlike plateauing trends in random, Bayesian, or module-only search.
Per-iteration cost and efficiency: e.g., ALFWorld uses %%%%55 $(\mathcal{R})$ 456%%%%\sim\$P \in \mathcal{P}$7 total).

The learned modules are reusable, amortizing search cost across future tasks and reducing overhead for subsequent deployments.

5. Module-Level Insights and Design Interpretability

AgentSquare produces explicit, human-interpretable “module recipes” identifying which innovations drive performance.

ALFWorld: Planning leverages a learned “TD” (tree-decomposition) module; reasoning employs “SF-ToT” (self-feedback + Tree-of-Thought, synthesize of Self-refine [Madaan et al.] and ToT [Yao et al.]); memory uses Generative Agents’ episodic schema. This triad surpasses Self-refine alone by over 25%.
WebShop: Employs “IO” (iterative optimization) planning, and “HTSS” (hierarchical taxonomy search strategy) reasoning.
Other tasks: Similar interpretable combinations arise for ScienceWorld, M3Tool, TravelPlanner, and PDDL, as detailed in Table A.5 and Figures A.12–A.17.

A notable observation is that for open-world or compositional tasks, planning and reasoning modules are typically performance bottlenecks, while sophisticated memory modules are critical primarily for long-horizon, embodied scenarios. These modular discoveries “chart” the landscape for future designer-driven or hybrid search-refinement.

6. Cost, Scalability, and Reusability Considerations

AgentSquare’s hybrid real+surrogate evaluation loop amortizes cost by:

Using $P \in \mathcal{P}$8 for cheap, high-throughput candidate screening.
Persisting high-performing modules in official pools for future tasks ("catalogue effect").

For example, most tasks converge within 10–20 iterations, and code-level innovations discovered are immediately transferable by virtue of the standardized module interface.

One-time search expense contrasts with repeated per-task engineering in prior work.

7. Key Equations and Formalization Summary

The framework’s main equations and algorithmic operations:

MoLAS objective (Eq. 1):

$P \in \mathcal{P}$9

Module recombination proposer (Eq. 2):

$R \in \mathcal{R}$0

Module evolution programmer (Eq. 3):

$R \in \mathcal{R}$1

Performance predictor (Eq. 4):

$R \in \mathcal{R}$2

Algorithm 1 describes the detailed alternation, pool management, and experience updating.

In summary, AgentSquare operationalizes LLM agent design as a discrete, standardized, modular search problem, leveraging LLMs both as generative operators (module programming and recombination) and as performance surrogates, resulting in empirically superior and interpretable agent architectures for diverse reasoning and interaction environments (Shang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

AgentSquare: Automatic LLM Agent Search in Modular Design Space (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgentSquare.