- The paper introduces an orchestrator agent that replaces static workflows with adaptive, natural-language communication for managing complex tasks.
- It benchmarks performance on the GAIA tasks, showing up to an 8.49% accuracy improvement under fault-tolerant, heterogeneous settings compared to rule-based systems.
- The study demonstrates enhanced scalability and edge-case handling through emergent coordination patterns and real-time semantic auditing.
Motivation and Limitations of Rule-Based Workflows
Prevalent LLM-based Multi-Agent Systems (MAS) for complex, open-domain tasks typically rely on rule-based, workflow-driven architectures wherein human engineers predefine discrete task states and hard-code routing and context injection logic, exemplified by systems such as OWL and MetaGPT. This paradigm, while effective for tractable, well-scoped scenarios, fundamentally suffers from two bottlenecks: (a) the combinatorial explosion of possible task states in realistic, dynamic environments renders exhaustive state enumeration infeasible, and (b) encoding handling for diverse edge cases through manual work rapidly becomes unmanageable and incomplete.
The core structure of the canonical workflow-driven OWL system is a decision-tree (Figure 1) in which each branch pre-specifies not only subtasking and agent routing but also success/failure criteria. When agents return ambiguous or incomplete outputs (for example, missing subfields for some entries in an extraction task), the decision logic predicated on brittle, coarse-grained state labeling often fails to recognize partial fulfillment, resulting in downstream propagation of corrupted state and suboptimal task outcomes.
Figure 1: Decision-making tree representation of the OWL architecture.
The proposed paradigm eliminates the dependence on static, human-crafted workflows by introducing a dedicated information flow orchestrator agent, which continuously monitors task progress and conducts adaptive, natural-language-based coordination with all other agents through an Agent-to-Agent (A2A) communication toolkit rooted in CORAL. The orchestrator controls global task state by explicit message-passing, using primitives for (i) issuing queries and instructions, and (ii) asynchronously receiving agent responses. All inter-agent communication flows through the orchestrator, enforcing partially asymmetric topologies that maintain centralized interpretability while enabling decentralized execution.
Figure 2: Overview of the proposed Information-Flow-Orchestrated Multi-Agent Paradigm via Agent-to-Agent (A2A) Communication.
The system's message generation for both orchestration and agent replies conditions not only on static role prompts but also on the full communication history, emerging state tracking, and the live task query. Rather than advancing along pre-specified routes, the orchestrator can dynamically (re-)allocate subtasks, refine instructions, request clarification, or escalate ambiguous results, with explicit success/failure criteria tightened on the fly. This permits robust handling of partial receipts, semantic drift, tool invocation failures, and unanticipated contingencies, all expressible in natural language.
Benchmark Evaluation and Numerical Results
Experiments are conducted on the GAIA benchmark, encompassing 165 validation tasks spanning Levels 1-3 in generalist assistant domains integrating web search, multimodal reasoning, and tool use. The baseline is the open-source workflow-based OWL system, matched meticulously for both agent roles and LLM configurations to ensure experimental parity. The evaluation encompasses both homogeneous (all agents: Grok 4.1 Fast) and heterogeneous (main agents: Grok 4.1 Fast; worker agents: GPT 4.1 Mini) model deployments.
Key pass@1 accuracy results are as follows:
- Homogeneous configuration: Both the information-flow-orchestrated paradigm and workflow-based OWL achieve 64.24% overall accuracy. The proposed system exhibits marginally increased token consumption on simple tasks, attributed to explicit message passing overhead.
- Heterogeneous, fault-tolerant configuration: The paradigm achieves 63.64% accuracy, exceeding OWL by 8.49 percentage points (OWL: 55.15%), with comparable or superior efficiency for complex tasks (see Figure 3).

Figure 3: Cumulative distribution functions (CDFs) of token consumption for OWL and the proposed Information-Flow-Orchestrated MAS under different model configurations.
Crucially, as worker agents are weakened, OWL's heavy reliance on static workflows exacerbates the performance cliff, while the orchestrator-driven system preserves accuracy by dynamically correcting, re-routing, and re-defining subtasks in response to partial or erroneous outputs.
Analysis of Emergent Task Coordination and Edge Case Handling
Case-level analysis of execution logs uncovers several distinct, robust coordination patterns emergent from the information flow orchestration paradigm. These behaviors are not encoded in any static workflow but arise adaptively as the orchestrator interrogates, refines, or reallocates tasks based on real-time agent responses:
For edge case management, three main orchestrator-initiated strategies are observed:
These adaptive capabilities are inaccessible to systems governed by static, hard-coded workflows, especially as the complexity of agent interaction or uncertainty in agent output increases.
Theoretical and Practical Implications
Transferring the responsibility for workflow supervision and adaptation from human designers to an LLM-powered orchestrator agent represents a critical step toward scalable, generalist MAS with tractable robustness guarantees. In principle, the architecture presents several theoretical advantages:
- The communication-driven design is compatible with arbitrary, open-ended agent topologies and variable agent granularity, enabling scaling beyond fixed decision trees.
- By leveraging prompt engineering for orchestration, the system is amenable to self-supervised learning, meta-reasoning, and potential integration of reinforcement signals for further improvement.
- The paradigm is inherently tool-agnostic, with all agents (including the orchestrator) abstracted as promptable LLM-tool hybrid agents.
Practically, this shift dramatically reduces the labor and fragility associated with edge case enumeration, manual routing, or rigid policy coding, accelerating deployment cycles and enabling real-world, large-scale MAS applications in unbounded environments. The system's success in preserving performance even as agent reliability degrades highlights its fault tolerance and adaptive efficiency.
Future Directions
Extending the proposed paradigm to domain-specific tasks with strong structural priors, or integrating hybrid architectures with partial domain knowledge encoding, are natural next steps. Future research could also focus on optimizing orchestrator prompts for meta-reasoning, introducing learning-driven orchestration policies, or analyzing emergent coordination behaviors as LLM capabilities and tool integration mature.
Conclusion
The information-flow-orchestrated multi-agent system via A2A communication effectively supersedes rule-based workflows for generalist LLM-based MAS, yielding superior or equivalent accuracy on the GAIA benchmark under both standard and fault-injected settings. The paradigm enables emergent, flexible task coordination and dynamic edge case handling, operationalized through explicit, natural-language orchestration rather than brittle, human-coded state machines. This agent-driven supervision approach offers a scalable and interpretable alternative for next-generation, adaptable multi-agent systems.