Papers
Topics
Authors
Recent
Search
2000 character limit reached

Anemoi: A Semi-Centralized Multi-agent System Based on Agent-to-Agent Communication MCP server from Coral Protocol

Published 23 Aug 2025 in cs.MA and cs.CL | (2508.17068v2)

Abstract: Recent advances in generalist multi-agent systems (MAS) have largely followed a context-engineering plus centralized paradigm, where a planner agent coordinates multiple worker agents through unidirectional prompt passing. While effective under strong planner models, this design suffers from two critical limitations: (1) strong dependency on the planner's capability, which leads to degraded performance when a smaller LLM powers the planner; and (2) limited inter-agent communication, where collaboration relies on costly prompt concatenation and context injection, introducing redundancy and information loss. To address these challenges, we propose Anemoi, a semi-centralized MAS built on the Agent-to-Agent (A2A) communication MCP server from Coral Protocol. Unlike traditional designs, Anemoi enables structured and direct inter-agent collaboration, allowing all agents to monitor progress, assess results, identify bottlenecks, and propose refinements in real time. This paradigm reduces reliance on a single planner, supports adaptive plan updates, and minimizes redundant context passing, resulting in more scalable and cost-efficient execution. Evaluated on the GAIA benchmark, Anemoi achieved 52.73% accuracy with a small LLM (GPT-4.1-mini) as the planner, surpassing the strongest open-source baseline OWL (43.63%) by +9.09% under identical LLM settings. Our implementation is publicly available at https://github.com/Coral-Protocol/Anemoi.

Summary

  • The paper introduces a semi-centralized MAS that uses structured A2A communication via an MCP server to overcome centralized planning limitations.
  • It demonstrates a +9.09% performance improvement over the OWL system on the GAIA benchmark, emphasizing collaborative refinement and reduced context redundancy.
  • The architecture minimizes token overhead and latency while enabling scalable, modular integration of diverse agent types for real-world applications.

Anemoi: A Semi-Centralized Multi-Agent System Leveraging Agent-to-Agent Communication via MCP Server

Introduction and Motivation

The Anemoi framework introduces a semi-centralized multi-agent system (MAS) architecture, fundamentally diverging from the prevailing context-engineering plus centralized paradigm in generalist MAS. Traditional systems rely on a strong planner agent to coordinate worker agents through unidirectional prompt passing, which results in two major limitations: (1) performance degradation when the planner is powered by a smaller LLM, and (2) inefficient inter-agent collaboration due to prompt concatenation and context injection, leading to redundancy and information loss. Anemoi addresses these issues by enabling structured, direct agent-to-agent (A2A) communication via the MCP server from Coral Protocol, allowing all agents to monitor progress, assess results, and propose refinements in real time. Figure 1

Figure 1: Architecture of the Anemoi: a semi-centralized multi-agent system based on the A2A communication MCP server from Coral Protocol.

System Architecture and Communication Protocol

Anemoi's architecture is built around a dedicated MCP server that facilitates thread-based communication among agents. Each agent connects to the MCP server, which provides primitives for agent discovery, thread management, and message exchange. Threads serve as contextual compartments, ensuring that messages remain within their respective conversations and supporting directed queries and task delegation.

The agent composition includes:

  • Planner Agent: Generates the initial plan and initiates coordination.
  • Critique Agent: Evaluates contributions for validity and certainty.
  • Answer-Finding Agent: Compiles and submits the final response.
  • Web Agent: Executes web searches and simulates browser actions.
  • Document Processing Agent: Handles diverse document formats.
  • Reasoning/Coding Agent: Specializes in reasoning, coding, and offline data processing.

All agents are integrated with the MCP toolkit, enabling them to monitor progress, track step completion, and propose new ideas throughout execution. The communication pattern supports dynamic plan refinement and consensus-based answer submission, reducing reliance on a single planner and minimizing redundant context passing. Figure 2

Figure 2: Overview of Anemoi. The system includes a planning agent to make initial plan, and a set of agents with different capability. The A2A communication MCP server enables all agents to monitor progress together.

Experimental Evaluation

Benchmark and Baselines

Anemoi was evaluated on the GAIA benchmark, which comprises real-world, multi-step tasks requiring web search, multi-modal file processing, and coding capabilities. The experimental setup ensured parity with the strongest open-source baseline, OWL, by using identical worker agent configurations and toolkits. The planner agent in Anemoi was powered by GPT-4.1-mini, while worker agents used GPT-4o. This configuration was chosen to highlight the robustness of the semi-centralized paradigm under weaker planner models.

Performance Results

Anemoi achieved an accuracy of 52.73% on the GAIA validation set (pass@3), outperforming OWL (43.63%) by +9.09 percentage points under identical LLM settings. Notably, Anemoi also surpassed several proprietary and open-source frameworks that employed stronger LLMs, demonstrating the efficacy of the A2A communication paradigm in mitigating the limitations of context-engineering-based coordination.

Comparative Task Attribution and Error Analysis

Task Attribution Analysis

A detailed comparison of task attribution between Anemoi and OWL revealed that Anemoi's additional successes were primarily due to collaborative refinement (52%) and reduced context redundancy (8%), with the remainder attributed to stochastic worker behavior (40%). Conversely, OWL's successes over Anemoi were predominantly due to stochastic worker behavior (90%) and, to a lesser extent, communication latency (10%). Figure 3

Figure 3: Comparison of task attribution categories between Anemoi and OWL. The donut chart illustrates the distribution of reasons why Anemoi succeeded where OWL failed, and vice versa.

Error Analysis

Anemoi's remaining errors were analyzed, with the largest fraction attributed to LLM capability limitations (45.6%), followed by toolkit limitations (20.6%), incorrect plans (11.8%), communication latency (10.3%), annotation mistakes (7.4%), and LLM hallucinations (4.4%). The error profile underscores the importance of further improving agent toolkits and LLM reliability, as well as optimizing communication latency in agent orchestration. Figure 4

Figure 4: Remaining errors of the Anemoi.

Implementation Considerations and Trade-offs

The Anemoi system demonstrates that semi-centralized coordination via A2A communication can sustain high performance even with weaker planner models, provided that worker agents are sufficiently capable. The thread-based MCP server architecture offers contextual isolation and efficient message routing, reducing token overhead and improving scalability. However, the system's performance is still bounded by the capabilities of the underlying LLMs and toolkits, and communication latency can impact task completion in time-sensitive scenarios.

Resource requirements are moderate, as the MCP server can be deployed on standard cloud infrastructure, and agent orchestration scales linearly with the number of agents. The modular design facilitates integration of new agent types and toolkits, supporting extensibility for domain-specific applications.

Implications and Future Directions

The Anemoi framework advances the state of MAS by demonstrating that direct, structured inter-agent communication can overcome the bottlenecks of centralized planning and context engineering. The empirical results suggest that future MAS architectures should prioritize adaptive, consensus-driven coordination and minimize reliance on prompt concatenation. Further research should focus on enhancing agent toolkits, improving LLM reliability, and optimizing communication protocols to reduce latency.

Potential future developments include:

  • Integration of more diverse agent types (e.g., multimodal reasoning, external API agents).
  • Exploration of decentralized consensus mechanisms for fully distributed MAS.
  • Application of Anemoi in real-world domains such as autonomous research, enterprise automation, and collaborative robotics.

Conclusion

Anemoi introduces a robust semi-centralized MAS architecture leveraging A2A communication via the MCP server, enabling scalable, adaptive, and cost-efficient agent coordination. The system achieves strong empirical performance on the GAIA benchmark, particularly under weaker planner models, and provides a blueprint for future MAS designs that emphasize direct inter-agent collaboration and dynamic plan refinement. The results highlight the practical and theoretical advantages of structured agent communication, marking a significant step toward scalable, generalist multi-agent AI systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Practical Applications

Immediate Applications

  • Cost-optimized multi-agent orchestration for knowledge work — Sector: software, enterprise IT — What: Replace centralized, prompt-concatenation orchestrators with Anemoi’s semi-centralized A2A threads to coordinate web search, document processing, and coding agents for tasks like report generation, competitive analysis, and KPI dashboards — Tools/Workflow: Coral Protocol A2A MCP server, Planner (small LLM), Web/Document/Reasoning-Coding workers, Critique + Answer-Finding gating — Dependencies/Assumptions: Access to GPT-4.1-mini (or similar) for planner and stronger worker LLMs; stable toolkits for web automation and file I/O; network reliability for thread latency
  • Engineering productivity copilot for triage, reproduction, and patching — Sector: software engineering — What: A2A-coordinated agents to reproduce bugs (web/file/state capture), propose fixes (Reasoning/Coding agent), verify with Critique agent, and package patches — Tools/Workflow: A2A threads bound to issue trackers (Jira/GitHub), CI triggers, unit/regression test scaffolding, consensus check before PR — Dependencies/Assumptions: Secure repo access; tool-use guardrails; deterministic test environments; audit needs addressed via thread logs
  • Enterprise RPA with verifiable handoffs — Sector: operations, finance back office — What: Semi-centralized agents automate invoice extraction, reconciliations, spreadsheet transformations, and cross-system updates with critique-verified checkpoints — Tools/Workflow: Document Processing agent (PDF/DOCX), Reasoning/Coding agent (Excel/CSV scripting), Web agent (portal updates), consensus/approval steps for SOX/compliance — Dependencies/Assumptions: OCR/data-extraction quality; structured access to ERPs/CRMs; human-in-the-loop for exceptions
  • Customer support co-resolution unit — Sector: customer service — What: Web agent mines KB/forums, Document agent pulls internal SOPs, Reasoning/Coding agent drafts fix scripts/macros, Critique enforces policy/compliance, Answer-Finding finalizes response — Tools/Workflow: CRM integration, A2A threads per ticket, escalation templates, consensus before customer-facing messages — Dependencies/Assumptions: Accurate KB indexing; rate limits on external sites; data-privacy constraints; latency SLAs
  • Research/literature assistant with self-critique — Sector: academia, R&D — What: Multi-agent literature review, PDF table extraction, code replication (Reasoning/Coding agent), structured critique and consensus before drafting summaries — Tools/Workflow: A2A thread per research question, citation and evidence trackers, replication notebooks, final synthesis by Answer-Finding agent — Dependencies/Assumptions: Publisher access rights; accurate PDF parsing; compute sandbox for code execution
  • Audit-ready workflow logs and provenance — Sector: governance, compliance — What: Use thread compartmentalization to capture who said what, when, and why for audit, reducing prompt-sprawl and information loss — Tools/Workflow: MCP thread archives, mention-based gating, consensus snapshots, immutable log storage — Dependencies/Assumptions: Retention policies; PII redaction; secure storage; regulator-acceptable traceability format
  • Data labeling and review with agent consensus — Sector: ML/AI ops — What: Worker agents label examples (text, audio, doc), Critique agent flags uncertainty, consensus voting yields high-confidence labels — Tools/Workflow: Active learning loop, pass@k sampling, uncertain-case escalation, A2A adjudication — Dependencies/Assumptions: Toolchains for multi-modal ingestion; annotation policy encoding; budget for multi-pass labeling
  • Education content and assessment generation — Sector: education — What: Web and Document agents source materials, Reasoning/Coding agent builds graded quizzes and auto-checkers, Critique validates correctness and levels, Answer-Finding produces educator-ready packs — Tools/Workflow: LMS connectors, difficulty calibration policies, per-topic A2A threads — Dependencies/Assumptions: Content licensing; bias/age-appropriateness checks; accessibility standards
  • Low-cost planner swap for existing agent stacks — Sector: software platforms — What: Retrofit centralized MAS (e.g., OWL-like) to Anemoi’s A2A pattern to keep strong workers but downsize the planner for cost and scalability gains — Tools/Workflow: Adapter layer from CAMEL/LangChain/Assistants to MCP primitives (list_agents, create_thread, send_message, wait_for_mentions) — Dependencies/Assumptions: Compatibility of tool APIs; correct agent discovery; observability to detect planner capability gaps
  • Incident response triage co-pilot — Sector: cybersecurity, IT ops — What: Web agent ingests threat intel, Document agent parses logs/reports, Reasoning agent correlates indicators, Critique validates hypotheses, consensus gates containment recommendations — Tools/Workflow: SIEM/SOAR integration, thread-based playbooks, consensus thresholds for auto-actions — Dependencies/Assumptions: Secure data paths; strict role separation; false-positive handling; time-bound latencies

Long-Term Applications

  • Cross-organization “Internet of Agents” for supply chains — Sector: manufacturing, logistics — What: Semi-centralized A2A threads across firms to reconcile orders, shipment events, and quality docs; consensus to resolve mismatches — Tools/Workflow: Inter-org MCP federation, identity/trust layers, provenance and payments — Dependencies/Assumptions: Standards for agent identity and auth; legal agreements; robust cross-domain security
  • Clinical workflow co-pilot with adaptive planning — Sector: healthcare — What: Agents coordinate prior auth, coding, chart abstraction, and guideline checks with critique-based safety gating and detailed thread logs — Tools/Workflow: EHR connectors, medical coding toolkits, safety policies and clinician-in-the-loop consensus — Dependencies/Assumptions: Regulatory approval (HIPAA/GDPR), rigorous validation, domain-tuned LLMs, fault-tolerant latency
  • Legal e-discovery and case assembly — Sector: legal — What: Multi-agent ingestion of large corpora, timeline building, precedent linking, with Critique ensuring citation fidelity and Answer-Finding producing briefs — Tools/Workflow: MCP-driven doc pipelines, chain-of-custody logs, opposing-argument simulation threads — Dependencies/Assumptions: Access to licensed legal databases; high-precision extraction; liability and confidentiality controls
  • Financial due diligence and KYC/AML orchestration — Sector: finance — What: Web and Document agents screen news/filings/sanctions, Reasoning agent consolidates risk narratives, Critique tests contradictions; consensus gates onboarding decisions — Tools/Workflow: Core banking/CRM APIs, risk policy codification, explainable thread logs for regulators — Dependencies/Assumptions: Data source reliability; regulator-acceptable audit trails; bias mitigation; strict latency/uptime
  • Autonomous software teams for feature delivery — Sector: software — What: Planner decomposes epics, workers design/implement/tests, Critique enforces standards, consensus ships via CI/CD — Tools/Workflow: Agent IDEs, design-review threads, test-generation pipelines, release gating via votes — Dependencies/Assumptions: Stronger reasoning/coding LLMs; robust sandboxing; IP/security policies; human oversight norms
  • Multi-robot and cyber-physical coordination — Sector: robotics, manufacturing, logistics — What: Extend A2A to embodied agents with real-time state sharing, bottleneck detection, and plan adaptation — Tools/Workflow: RT messaging bridges to ROS/OPC-UA, safety monitors, failover consensus — Dependencies/Assumptions: Hard real-time constraints; safety certification; reliable localization/sensing; edge compute
  • Government digital casework and FOIA processing — Sector: public sector — What: Agents triage requests, extract records, redact PII, and compile responses with critique-based compliance checks — Tools/Workflow: Records systems connectors, redaction toolchains, policy-encoded critique — Dependencies/Assumptions: Statutory compliance; public-records access; auditability; accessibility requirements
  • Scientific discovery pipelines — Sector: science/biotech — What: Literature grounding, protocol design, data analysis scripts, and replication/critique loops; integrate with lab automation long-term — Tools/Workflow: A2A experimental planning, notebook generation, provenance tracking, lab-robot APIs — Dependencies/Assumptions: High-accuracy extraction; domain-tuned models; wet-lab integration; IRB/ethics oversight
  • Edge/private deployments with small planners — Sector: privacy-first enterprises, defense — What: Run planners on-prem/edge while bursting worker tasks to cloud for heavy lifting, reducing data egress — Tools/Workflow: Hybrid MCP topology, policy-based data routing, latency-aware scheduling — Dependencies/Assumptions: Reliable split-compute orchestration; strict data-classification rules; offline fallbacks
  • Standardization and policy for agent interoperability — Sector: policy, standards — What: Advance MCP-like A2A as an interop standard (identity, observability, safety hooks), enabling regulated, multi-vendor agent ecosystems — Tools/Workflow: Compliance profiles, conformance suites, security best practices for agent messaging — Dependencies/Assumptions: Broad vendor buy-in; governance bodies; reference implementations; red-team evaluations
  • SOC and fraud operations with layered consensus — Sector: cybersecurity, fintech — What: Multi-agent alert triage, case enrichment, hypothesis testing, and consensus-driven containment or case escalation — Tools/Workflow: Playbook threads, risk-scoring critique modules, action gating with human approval — Dependencies/Assumptions: High-fidelity signals; adversarial robustness; strict false-positive costs; round-trip SLAs
  • Safety-aligned, injection-resilient agent networks — Sector: AI safety, platform security — What: Leverage contextual compartmentalization and explicit mentions to reduce prompt injection spread; critique agents for continuous red-teaming — Tools/Workflow: Safety policies embedded in critique, sandboxed tool execution, cross-thread anomaly detection — Dependencies/Assumptions: Mature detection methods; telemetry across agents; standardized incident handling

Notes on feasibility across applications:

  • The paper’s demonstrated gains on GAIA with a small planner suggest immediate cost/performance benefits when swapping planners, but reliability still depends on worker tool quality and handling of web-agent latency.
  • Thread compartmentalization and explicit agent mentions improve auditability and reduce token overhead now; high-stakes domains will require extensive validation, domain-tuned models, and human oversight.
  • Open-source availability of Anemoi and the Coral Protocol MCP server accelerates prototyping; production deployments must address security, privacy, and regulatory compliance from the outset.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 47 tweets with 2810 likes about this paper.