Maestro: Joint Graph & Config Optimization for Reliable AI Agents

Published 4 Sep 2025 in cs.AI, cs.CL, cs.LG, and cs.SE | (2509.04642v1)

Abstract: Building reliable LLM agents requires decisions at two levels: the graph (which modules exist and how information flows) and the configuration of each node (models, prompts, tools, control knobs). Most existing optimizers tune configurations while holding the graph fixed, leaving structural failure modes unaddressed. We introduce Maestro, a framework-agnostic holistic optimizer for LLM agents that jointly searches over graphs and configurations to maximize agent quality, subject to explicit rollout/token budgets. Beyond numeric metrics, Maestro leverages reflective textual feedback from traces to prioritize edits, improving sample efficiency and targeting specific failure modes. On the IFBench and HotpotQA benchmarks, Maestro consistently surpasses leading prompt optimizers--MIPROv2, GEPA, and GEPA+Merge--by an average of 12%, 4.9%, and 4.86%, respectively; even when restricted to prompt-only optimization, it still leads by 9.65%, 2.37%, and 2.41%. Maestro achieves these results with far fewer rollouts than GEPA. We further show large gains on two applications (interviewer & RAG agents), highlighting that joint graph & configuration search addresses structural failure modes that prompt tuning alone cannot fix.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Maestro, a joint graph and configuration optimizer that tackles reliability by simultaneously tuning agent structures and parameters.
It employs alternating configuration and graph update steps, integrating numeric and reflective feedback to efficiently refine agent modules.
Empirical results on benchmarks reveal significant performance gains and sample efficiency, outperforming fixed-graph and traditional tuning methods.

Maestro: Joint Graph & Config Optimization for Reliable AI Agents

Motivation and Problem Statement

The paper addresses the reliability and robustness challenges in LLM-based agentic systems, emphasizing that agent quality depends on both the agent's computation graph (modules and information flow) and the configuration of each node (models, prompts, tools, hyperparameters). Existing optimization approaches predominantly focus on configuration tuning with a fixed graph, which leaves structural failure modes unaddressed. These include brittle pipelines, inadequate error handling, poor state management, and limited extensibility. The search space for reliable agents is high-dimensional, strongly coupled, and comprises mixed discrete and continuous variables, making joint optimization nontrivial.

Maestro: Holistic Agent Optimization

Maestro is introduced as a framework-agnostic optimizer that jointly searches over agent graphs and configurations. Agents are formalized as typed stochastic computation graphs, where nodes encapsulate capabilities (LLM calls, retrieval, tools, memory, validation) and edges define information/control flow. The search space supports non-sequential DAGs, persistent state, context gating, multi-model/multi-tool choices, and tunable hyperparameters.

Maestro alternates between two optimization steps:

C-step (Configuration Update): Fixes the graph and optimizes node configurations (prompts, models, tools, hyperparameters) under rollout and cost budgets.
G-step (Graph Update): Fixes the configuration and proposes local graph edits (add/remove/rewire nodes/edges, attach tools, change module types) within a trust region, leveraging observed utilities and reflective feedback.

Both numeric (scalar scores) and non-numeric (textual critiques, evaluator rubrics) feedback are ingested to guide targeted edits, improving sample efficiency and addressing failure modes that scalar-only objectives cannot correct.

Formalization

Let $G=(V,E)$ be the agent graph, with each node $v$ parameterized by configuration $c_v$ and each edge $e$ by adapter parameters $\alpha_e$ . The agent induces a randomized map $\Phi_{G,C}$ from inputs to outputs. The joint optimization objective is:

$\max_{G\in\mathcal{G},\, C\in\mathcal{C}}\, \mathbb{E}_{(x,m)\sim T}\left[\mu(Y_O, m)\right]$

subject to resource, structure, and rollout budgets.

Decision variables are both discrete (graph structure, tool choices) and continuous (model weights, prompt embeddings, hyperparameters). The optimization is performed under explicit constraints on cost, structure complexity, and training rollouts.

Comparison to Prior Work

MIPROv2, GEPA, GRPO: These methods optimize only configuration $C$ for a fixed graph $G$ , using SMBO, evolutionary, or policy-gradient approaches. They do not address structural deficiencies or enable graph mutations.
MAAS: Couples prompt and topology search but restricts the search space to predefined block types and sequential connectivity, lacking mechanisms for persistent state or flexible routing.
Reflection-based controllers (Reflexion): Attach verbal self-critique to fixed architectures but do not perform global, cross-component optimization.
Tooling frameworks (LangChain, Haystack, AutoGPT): Provide modular components but leave optimization to practitioners.

Maestro closes three key gaps: (i) incomplete design space coverage, (ii) rigid search spaces, and (iii) under-utilization of non-numeric feedback.

Empirical Results

HotpotQA

Config-only optimization: Maestro achieves 70.33% evaluation score with 240 rollouts, outperforming GEPA (69.00% with 6,438 rollouts).
Graph+config optimization: Maestro reaches 72.00% with 420 rollouts and 72.33% with 2,220 rollouts, >3% gain over GEPA/MIPROv2, with far fewer rollouts.
Key graph edit: Introduction of an entity extraction module to provide concrete handles for second-hop query reformulation.

IFBench

Config-only optimization: Maestro achieves 56.12% with 700 rollouts, exceeding GEPA (52.72%) and GEPA+Merge (55.95%).
Graph+config optimization: Maestro reaches 59.18% after 900 rollouts, >3% gain over baselines with >3,000 rollouts.
Key graph edit: Addition of a constraint validator module, enabling iterative rewrites until all constraints are satisfied.

Application Settings

Interviewer Agent: Initial agent achieves 2% complete rate. Maestro config-only optimization boosts to 66%; graph+config optimization further to 92%. Key graph edit: explicit state variable tracking completed branches.
RAG Agent: Initial agent scores 39.1%; Maestro config-only optimization reaches 58.9%; graph+config optimization achieves 80.4%. Key graph edit: addition of a numeric computation tool for statistical queries.

Implementation Considerations

Framework Agnosticism: Maestro operates independently of agent implementation frameworks, supporting integration with DSPy, OpenAI Agents SDK, LangChain, etc.
Reflective Feedback: Textual feedback from execution traces is distilled into actionable graph/config edits, enabling sample-efficient search and targeted failure mode correction.
Budgeted Search: Explicit rollout/token budgets are enforced, with monotonic improvement criteria and trust-region graph edits to balance exploration and exploitation.
Scalability: Maestro demonstrates strong sample efficiency, achieving superior results with orders-of-magnitude fewer rollouts than prior methods.
Resource Requirements: The approach is compatible with both small and large LLMs, and can optimize over discrete/continuous configuration spaces and complex graph topologies.

Implications and Future Directions

The results substantiate that joint graph and configuration optimization is essential for robust, efficient, and reliable AI agents. Structural changes (validators, memory, conditional routing) eliminate entire classes of errors, while configuration tuning refines execution. Monolithic, prompt-centric agents are unlikely to generalize or recover from failure modes. Maestro provides a disciplined, budget-aware path to task-specific, controllable, and cost-efficient agents.

Future work may extend Maestro to:

Multi-agent systems with inter-agent communication and coordination.
Automated discovery of reusable agent subgraphs and modules.
Integration with online learning and continual adaptation.
Richer forms of reflective feedback, including human-in-the-loop critique and adversarial testing.

Conclusion

Maestro operationalizes holistic agent optimization by jointly searching over graph and configuration spaces, leveraging both numeric and reflective feedback. Empirical results across benchmarks and applications demonstrate consistent, substantial gains in reliability and efficiency, with strong sample efficiency. The framework provides a practical blueprint for building robust, controllable, and cost-aware AI agents, supporting the thesis that holistic graph+config optimization is necessary for reliable agentic AI.

Markdown