Papers
Topics
Authors
Recent
Search
2000 character limit reached

AgentOrchestra: Orchestrating Hierarchical Multi-Agent Intelligence with the Tool-Environment-Agent(TEA) Protocol

Published 14 Jun 2025 in cs.AI | (2506.12508v4)

Abstract: Recent advances in LLMs-based agent systems have demonstrated remarkable capabilities in solving complex tasks. Nevertheless, current protocols (e.g., A2A and MCP) suffer from insufficient capabilities in context management, limited adaptability to diverse environments, and the absence of dynamic agent architectures. To address these limitations, we propose the Tool-Environment-Agent (TEA) Protocol, which establishes a principled basis for integrating environments, agents, and tools into an unified system. The TEA protocol treats environments and agents as first-class resources, enabling comprehensive context management and adaptive environment integration. Based on this protocol, we introduce AgentOrchestra, a hierarchical multi-agent framework with a central planning agent that decomposes complex objectives and coordinates specialized agents. Each sub-agent is dedicated to specific functions, providing capabilities for data analysis, file operations, web navigation, and interactive reasoning. Notably, AgentOrchestra introduces a tool manager agent that supports intelligent evolution through dynamic tool creation, retrieval, and reuse mechanisms. Experiments on three widely used benchmarks show that AgentOrchestra consistently outperforms existing baselines, achieving state-of-the-art performance of 83.39% on GAIA and ranking among the top general-purpose LLM-based agents. These results highlight the effectiveness of the TEA Protocol and hierarchical organization in building general-purpose multi-agent systems.

Summary

  • The paper introduces the TEA Protocol, integrating tools, environments, and agents into a cohesive, scalable multi-agent framework.
  • Empirical results on benchmarks like GAIA demonstrate superior performance, achieving an accuracy of 83.39% through dynamic agent coordination.
  • The framework enables modular resource management and adaptive role assignment, paving the way for tackling diverse and complex real-world tasks.

AgentOrchestra: Orchestrating Hierarchical Multi-Agent Intelligence with the TEA Protocol

The paper "AgentOrchestra: Orchestrating Hierarchical Multi-Agent Intelligence with the Tool-Environment-Agent(TEA) Protocol" (2506.12508) presents an innovative approach to address the limitations of existing LLM-based agent systems. By introducing the Tool-Environment-Agent (TEA) Protocol, the authors propose a unified framework that integrates environments, agents, and tools into a cohesive system, laying the foundation for building scalable multi-agent systems capable of solving complex tasks across diverse domains.

Hierarchical Multi-Agent Framework

The proposed TEA Protocol establishes principles for context management, adaptive environment integration, and dynamic agent architectures, overcoming the constraints of existing protocols like A2A and MCP. Based on this protocol, a hierarchical multi-agent framework is constructed, where a central planning agent decomposes complex objectives and coordinates a team of specialized sub-agents. These sub-agents are tasked with specific functions such as data analysis, web navigation, and interactive reasoning (Figure 1). Figure 1

Figure 1: Planning Agent Workflow.

The central architecture facilitates modular collaboration among agents while allowing flexible composition and adaptation. Each sub-agent benefits from the TEA Protocol's ability to treat environments and agents as first-class resources, enabling efficient coordination and resource management across the multi-agent system.

TEA Protocol: Core Components and Transformations

The TEA Protocol comprises three main components: tool context, environment context, and agent context, each managed through respective modular interfaces. Seamless agent-to-tool, tool-to-agent, environment-to-tool, and tool-to-environment transformations are crucial features, enabling dynamic resource orchestration and enabling computational entities to adapt their functional scope in response to evolving task demands. Figure 2

Figure 2: Architecture of the TEA Protocol.

Through sophisticated management of these components, the TEA Protocol facilitates interoperability across different task environments, ensuring agents can adapt and thrive in various computational domains without specific adaptations.

Empirical Validation and Performance

The efficacy of the hierarchical multi-agent framework built on the TEA Protocol is empirically validated on benchmarks such as GAIA (Figure 3), SimpleQA, and HLE. The results display the consistent outperformance of baseline models, achieving state-of-the-art results on the GAIA benchmark with an overall accuracy of 83.39%. Figure 3

Figure 3: GAIA Test Results.

Significantly, the agent system's success in achieving these results hinges on its ability to effectively coordinate resource allocation, dynamically manage sub-agent execution, and integrate uniquely tailored tools.

Limitations and Future Directions

Despite its robust scalability and general-purpose capabilities, certain limitations remain, such as the current lack of dynamic role allocation for agents during runtime and challenges in handling fine-grained multimodal tasks. Future work is projected to explore agent self-evolution through the optimization of prompts, tools, and organizational structures, potentially broadening the framework's applicability.

Conclusion

The "AgentOrchestra: Orchestrating Hierarchical Multi-Agent Intelligence with the TEA Protocol" paper lays a comprehensive foundation for developing adaptable, robust AI agents. Through the TEA Protocol and its sophisticated hierarchical architecture, the research delineates a scalable path forward for creating versatile multi-agent systems that navigate complex, real-world tasks across diverse domains efficiently and effectively.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview: What is this paper about?

This paper introduces a new way to build smart AI teams that can work together to solve complex tasks. The authors present the TEA Protocol (short for Tool–Environment–Agent) and a system called AgentOrchestra. Think of it like an orchestra: the TEA Protocol is the music sheet and rules, and AgentOrchestra is the band with a conductor and different musicians (specialized AI agents). Together, they help AI handle a wide variety of real-world jobs—like researching online, analyzing data, and using computer tools—more reliably and efficiently.

Objectives: What were the researchers trying to do?

The researchers wanted to answer a few simple questions:

  • How can we make AI agents work better across many different situations (websites, files, apps) without redesigning everything each time?
  • Can we treat tools (like a calculator), environments (like a web browser), and agents (like a researcher) in a unified way so they can easily share information and cooperate?
  • Does organizing agents in a hierarchy—with a planner that assigns tasks to specialists—actually improve performance on tough tests?

Methods: How did they approach the problem?

They designed two main things:

1) The TEA Protocol

A “protocol” is just a set of rules for how things interact. The TEA Protocol treats three types of resources as equals:

  • Tools: single functions with clear inputs and outputs (like “search the web,” “read a PDF,” or “run code”).
  • Environments: places where actions happen (like a browser or a computer screen) with rules and actions (click, type, open tab).
  • Agents: decision-makers that plan, reason, and use tools or environments to reach goals.

The TEA Protocol has three “context” parts—each handles one resource type:

  • Tool Context Protocol (TCP): registers tools, describes their parameters clearly, and keeps track of how tools relate to each other.
  • Environment Context Protocol (ECP): gives environments a consistent interface so agents can interact with any browser or computer in the same way.
  • Agent Context Protocol (ACP): standardizes how agents are described (their roles, skills, goals) and how they collaborate.

It also defines six “transformations,” which let something play a different role when needed. For example:

  • Agent-to-Tool (A2T): wrap an agent’s behavior into a simple tool call.
  • Tool-to-Agent (T2A): let a tool act like a simple agent that can plan and execute.
  • Environment-to-Tool (E2T): turn environment actions into tool-like functions (e.g., “Click,” “Navigate”).
  • Tool-to-Environment (T2E): group related tools into a full environment (e.g., coding tools become a programming workspace).
  • Agent-to-Environment (A2E): expose an agent’s behavior so others can interact with it like a simulated world.
  • Environment-to-Agent (E2A): make an environment act more like an intelligent agent that can make decisions.

2) AgentOrchestra (the AI team)

AgentOrchestra is a hierarchical multi-agent system with a central planner and specialized helpers. The planner breaks big jobs into smaller steps and delegates them to the right specialists. The main team members are:

  • Planning Agent (the conductor): understands the user’s goal, creates a step-by-step plan, assigns tasks, tracks progress, and decides when the job is done.
  • Deep Researcher Agent (the scout): creates smart search queries, scans the web for useful sources, and builds a research summary with citations.
  • Browser Use Agent (the hands): controls the web browser precisely—navigates pages, clicks buttons, fills forms, handles PDFs and videos, and even does pixel-level actions when needed.
  • Deep Analyzer Agent (the thinker): reads and analyzes text, code, images, audio, and video; performs multi-step reasoning and writes clear reports.
  • Tool Manager Agent (the mechanic): finds the right tools, builds new ones when needed, tests them, and reuses them to make future tasks faster.

To test their system, the authors ran it on well-known benchmarks:

  • GAIA: real-world problems that often require web use and reasoning.
  • SimpleQA: thousands of short factual questions.
  • HLE (Humanity’s Last Exam): a tough, multimodal reasoning test.

“Pass@1” means the system’s first answer must be correct to count.

Findings: What did they discover?

The results show the approach works very well:

  • On GAIA, AgentOrchestra achieved 83.39% accuracy overall, a state-of-the-art score. It did especially well on easier and medium questions (92.45% and 83.72%), and performed strongly even on hard ones (57.69%). Adding the Tool Manager gave a notable boost.
  • On SimpleQA, it reached 95.3% accuracy—far better than single models without tools—by checking multiple sources and verifying facts to avoid mistakes.
  • On HLE, it scored 25.9%, beating several strong baselines.

They also ran “ablation” tests (turning parts on and off) to see what mattered most. Key takeaways:

  • Using both the Deep Researcher (coarse search) and Browser Use Agent (fine interaction) almost doubled performance compared to the planner alone.
  • The Deep Analyzer added further improvements for complex reasoning.
  • The Tool Manager created more than 50 new tools during testing and reused about 30% of them, showing it can grow a useful tool library over time.

Why this is important: It shows that carefully coordinated teams of AI agents—guided by a clear protocol—can solve more complex, real-world tasks than single models or loosely connected tools.

Implications: Why does this matter and what’s next?

This research suggests a practical path toward more general, dependable AI assistants:

  • A unified protocol (TEA) makes it easier to plug in new tools and environments without messy custom code for each case.
  • A hierarchical team (AgentOrchestra) lets different specialists handle what they’re best at, while the planner keeps everything organized.
  • Dynamic transformations mean the system can adapt: an agent can become a tool if needed, or a bunch of tools can act like a full environment.

Limitations still exist: the system doesn’t yet automatically change agent roles during a task or “self-evolve” its prompts and structures on the fly. It also finds some fine-grained vision and video tasks tricky. The authors plan to add self-improvement features and new specialized agents (like advanced visualization) to handle even more complex jobs.

In short, this work lays a strong foundation for building flexible, trustworthy AI teams that can tackle a wide range of everyday and expert tasks by coordinating planning, web interaction, deep analysis, and smart tool management.

Glossary

  • Action space: The set of actions available to an agent within an environment. "then incorporates the entire action space into a toolkit"
  • Agent Communication Protocol (ACP): A protocol for messaging and coordination between agents in prior literature. "the Agent Communication Protocol (ACP)~\citep{ehtesham2025survey}"
  • Agent Context Protocol (ACP): The paper’s schema for registering, representing, and orchestrating agents with states, metadata, and interactions. "we propose the Agent Context Protocol (ACP)."
  • Agent Network Protocol (ANP): A protocol aimed at enhancing interoperability and discovery in multi-agent systems. "the Agent Network Protocol (ANP)~\citep{ehtesham2025survey}"
  • Agent-to-Environment (A2E): A transformation that exposes an agent as an environment with observable decision dynamics. "Agent-to-Environment (A2E). Encapsulates an existing agent as an interactive environment, exposing its decision rules and behavioral dynamics for other agents to explore, learn, or be evaluated."
  • Agent-to-Tool (A2T): A transformation that packages an agent’s capabilities behind a standardized tool interface. "Agent-to-Tool (A2T). Encapsulates an agent's capabilities and reasoning into a standardized tool interface, enabling seamless integration with existing tool ecosystems."
  • Agent2Agent (A2A): A protocol for direct communication between agents. "A2A protocol~\citep{google2024a2a} enables agent-to-agent messaging and coordination."
  • Browser use agent: A specialized agent for precise, automated web interaction and extraction. "A browser use agent enables fine-grained interaction with web content, directly engaging with videos, pdfs, and html elements to extract precise information."
  • Context binder: A registry component that binds and maintains contextual information across entities in the protocol. "where Σ\Sigma is a metadata/relations registry, C\mathcal{C} a context binder, and P\mathcal{P} is the family of cross-domain transformations."
  • Cross-domain transformations: Formal mappings enabling entities (tools, environments, agents) to assume alternative roles. "where Σ\Sigma is a metadata/relations registry, C\mathcal{C} a context binder, and P\mathcal{P} is the family of cross-domain transformations."
  • Deep analyzer agent: A workflow-oriented agent for multi-step, multimodal analytical reasoning. "A deep analyzer agent performs advanced reasoning and integrative analysis"
  • Deep researcher agent: An agent for large-scale information retrieval via query generation and iterative exploration. "A deep researcher agent conducts large-scale information retrieval by efficiently scanning and filtering web pages to identify promising sources."
  • DOM-level control: Interaction with web page elements via the Document Object Model rather than pixel-level manipulation. "cannot be effectively handled through DOM-level control alone"
  • Environment Context Protocol (ECP): A unified protocol for standardizing inputs, outputs, and rules across heterogeneous environments. "we introduce the Environment Context Protocol (ECP)"
  • Environment encapsulation: The method of wrapping or interfacing environments for agent interaction. "due to the dramatic differences in environment encapsulation methods"
  • Environment-to-Agent (E2A): A transformation that turns an environment into an autonomous agent. "Environment-to-Agent (E2A). Infuses reasoning and adaptive decision-making into an environment’s state dynamics, transforming it into an autonomous agent capable of pursuing goals and interacting strategically."
  • Environment-to-Tool (E2T): A transformation that exposes environment actions as standardized tool calls. "Environment-to-Tool (E2T). Converts environment-specific actions into standardized interfaces, allowing agents to interact via consistent tool calls."
  • Function Calling: A standardized interface for LLMs to invoke tools with structured parameters. "Standardized tool interfaces, such as OpenAI's Function Calling and Anthropic's MCP, have further streamlined tool integration"
  • GAIA: A benchmark assessing real-world reasoning, multimodal processing, and tool use. "achieving state-of-the-art performance of 83.39\% on GAIA"
  • Hierarchical multi-agent framework: An architecture where a planning agent coordinates specialized sub-agents for complex tasks. "a hierarchical multi-agent framework for general-purpose task solving"
  • Humanity’s Last Exam (HLE): A multimodal benchmark targeting human-level reasoning and general intelligence. "HLE. Our system achieves 25.9\% on the HLE benchmark"
  • Memory system: A component for persistent storage and retrieval of context across sessions. "an integrated memory system for persistent contextual storage and knowledge management across sessions."
  • Metadata/relations registry: A registry of entity metadata and their relationships within the protocol. "where Σ\Sigma is a metadata/relations registry, C\mathcal{C} a context binder, and P\mathcal{P} is the family of cross-domain transformations."
  • Model Context Protocol (MCP): A widely adopted tool protocol defining tool, prompt, and resource abstractions. "MCP~\citep{anthropic2024mcp} is the most widely adopted tool protocol, defined by three components: tools, prompts, and resources."
  • Multimodal research workflow: A research process integrating both text and visual inputs iteratively. "implemented as a multi-round, multimodal research workflow."
  • Observation: The agent’s aggregated view of task descriptions, history, environment state, and tool availability. "An observation captures task descriptions, execution histories, environment states, and tool availability"
  • Observation–action spaces: The formal pairing of what an agent perceives (observations) and what it can do (actions). "observation-action spaces largely rely on manual design"
  • Pass@1: An evaluation metric measuring the fraction of fully correct top predictions. "We report score (pass@1), which measures the proportion of questions for which the top prediction is fully correct."
  • Planning agent: The central orchestrator for task decomposition, routing, and adaptive planning. "The planning agent serves as the central orchestrator in our hierarchical framework, dedicated to high-level reasoning, task decomposition, and adaptive planning."
  • Protocol Transformations: The component defining interconversions among TCP/ECP/ACP for dynamic orchestration. "Protocol Transformations that define the interconversion relationships between TCP, ECP, and ACP"
  • Query–embedding similarity: A retrieval method ranking tools by similarity between a query and tool embeddings. "TCP stores each tool with an embedding and uses query–embedding similarity for candidate retrieval"
  • React-based agent: An agent using the ReAct (reasoning-then-action) paradigm for tool-calling. "The planning agent is implemented as a React-based~\citep{yao2023react} tool-calling agent"
  • Sandbox: An isolated execution environment to safely run actions and record results. "The action is executed in a sandbox, with results recorded back to memory"
  • State-of-the-art (SOTA): The best-known performance level at the time of writing. "GAIA. Our achieves SOTA results with 83.39\% overall accuracy"
  • Tool Context Protocol (TCP): A protocol extending MCP for richer tool registration, relationships, and context. "we propose the Tool Context Protocol (TCP), which extends MCP by supporting local and remote tool loading, detailed tool registration, and the novel ability to register agents as tools"
  • Tool manager agent: A specialized agent that creates, retrieves, and reuses tools to evolve the system’s capabilities. "introduces a tool manager agent that supports intelligent evolution through dynamic tool creation, retrieval, and reuse mechanisms."
  • Tool-to-Agent (T2A): A transformation enabling tools to act as agent actuators for goal-driven invocations. "Tool-to-Agent (T2A). Designates tools as an agent’s actuators, translating goals into parameterized invocations."
  • Tool-to-Environment (T2E): A transformation turning a set of tools into an environment abstraction with a unified action space. "Tool-to-Environment (T2E). Elevates a tool set into an environment abstraction, treating individual functions as actions within a unified action space."
  • Unified interface: An abstraction layer that standardizes interactions with diverse LLM models. "a unified interface for diverse LLMs (e.g., gpt-5) that abstracts model heterogeneity"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.