AWorld: Orchestrating the Training Recipe for Agentic AI

Published 28 Aug 2025 in cs.AI | (2508.20404v1)

Abstract: The learning from practice paradigm is crucial for developing capable Agentic AI systems, yet it is severely hampered by inefficient experience generation, a bottleneck especially pronounced in complex benchmarks like GAIA. To address this, we introduce AWorld, an open-source system engineered for large-scale agent-environment interaction. By distributing tasks across a cluster, AWorld accelerates experience collection by 14.6x compared to standard single-node, sequential execution. This critical speedup makes extensive reinforcement learning practical and scalable. Leveraging this capability, we trained a Qwen3-32B-based agent that significantly outperforms its base model, increasing its overall GAIA accuracy from 21.59% to 32.23%. On the benchmark's most challenging levels, our agent achieves a score of 16.33%, surpassing the performance of leading proprietary models. Our open-source system and resulting agent provide a practical blueprint for a complete agentic AI training pipeline, from efficient interaction to demonstrable model improvement.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AWorld, which achieves a 14.6x faster rollout generation via distributed processing to enhance agentic AI performance on GAIA.
It integrates modular components including agent construction, communication protocols, and state management to streamline scalable reinforcement learning.
Empirical results demonstrate substantial improvements in model pass rates, underscoring efficient experience generation as critical for robust agentic learning.

AWorld: Orchestrating the Training Recipe for Agentic AI

Introduction

The paper presents AWorld, an open-source framework designed to address the computational bottlenecks in agentic AI training, particularly those arising from inefficient experience generation in complex environments. The work is motivated by the limitations of current LLMs in solving real-world, multi-step reasoning tasks, as exemplified by the GAIA benchmark. AWorld integrates model selection, runtime construction, communication protocols, and training orchestration to enable scalable, efficient agent-environment interaction and reinforcement learning.

Figure 1: AWorld enables substantial performance gains on GAIA by accelerating experience generation 14.6x via distributed rollouts, making RL practical for large models.

Framework Architecture

AWorld is architected to support the full lifecycle of agentic AI development, from agent instantiation to distributed training. The framework is modular, comprising four principal components:

Agent Construction: Agents are instantiated with configurable logic, toolsets, and planning capabilities. The system supports prompt assembly, custom tool integration (e.g., browser automation, code execution), and flexible agent topologies for multi-agent collaboration.
Communication Protocols: A unified message-passing architecture enables robust communication between agents, tools, and environments. The Message API supports event-driven workflows, error handling, and extensibility for complex task coordination.
Runtime State Management: Distributed execution is managed via Kubernetes, allowing high-concurrency rollouts across sandboxed environments. State consistency is maintained through centralized trace servers and remote storage, supporting recovery and large-scale evaluation.
Training Orchestration: AWorld decouples rollout generation from model training, integrating with external RL frameworks (e.g., SWIFT, OpenRLHF) to enable scalable policy optimization.
Figure 2: The AWorld architecture supports both forward (experience generation) and backward (policy optimization) passes, enabling closed-loop agentic learning.

Figure 3: Message workflow in AWorld runtime, illustrating agent query handling and inter-component communication.

Distributed Rollout and Experience Generation

AWorld's core innovation lies in its distributed rollout engine, which orchestrates massive parallelization of agent-environment interactions. By leveraging Kubernetes, the system can concurrently execute thousands of long-horizon tasks, each encapsulated in isolated pods. This design overcomes the resource contention and instability inherent in single-node setups, making large-scale experience generation tractable.

Figure 4: Massively parallel rollouts in AWorld, managed by Kubernetes, enable scalable data generation for RL.

Figure 5: Action-state rollout demonstration, with multiple pods running in parallel to maximize experience throughput.

Empirical Analysis: Rollout Scale and Agent Performance

The paper provides a systematic analysis of the relationship between rollout scale and agent performance on the GAIA benchmark. Experiments demonstrate a universal trend: increasing the number of rollouts per task leads to substantial improvements in pass rates for all evaluated models. For example, Claude-3.7-Sonnet's pass@1 increases from 47.9% to 76.4% as rollouts scale from 1 to 32, while GPT-4o's pass@1 more than doubles.

Figure 6: Pass rate as a function of rollout scale on GAIA; all models benefit from increased interaction attempts.

This result empirically validates the necessity of efficient, large-scale experience generation for agentic learning. The bottleneck in agent training is not model optimization but the ability to generate sufficient successful trajectories for RL.

Efficiency Gains and Training Outcomes

AWorld achieves a 14.6x speedup in rollout generation compared to sequential single-node execution, reducing the total cycle time for rollout and training from 7839s to 669s. This efficiency shift enables practical RL on complex benchmarks.

Using AWorld, the authors fine-tune Qwen3-32B via SFT (on 886 successful trajectories) followed by RL (using GRPO and rule-based rewards). The resulting agent, Qwen3-32B-AWorld, achieves a pass@1 of 32.23% on GAIA, a 10.6 percentage point improvement over the base model. On the most challenging Level 3 questions, it attains 16.33%, outperforming all compared models, including GPT-4o and DeepSeek-V3. Notably, the agent also generalizes to xbench-DeepSearch, improving from 12% to 32% without direct training, indicating robust skill acquisition rather than overfitting.

Practical and Theoretical Implications

AWorld provides a scalable infrastructure for agentic AI, enabling efficient RL in environments with high resource demands and long-horizon tasks. The framework's modularity and integration capabilities position it as a practical solution for both single-agent and multi-agent systems. The empirical results challenge the notion that model size or architecture alone determines agentic performance; instead, rollout efficiency and experience diversity are critical.

Theoretically, the work underscores the importance of the "learning from practice" paradigm and the need for frameworks that optimize the entire agent-environment-training loop. The demonstrated generalization across benchmarks suggests that distributed RL with sufficient experience can yield agents with transferable reasoning capabilities.

Future Directions

The authors outline a roadmap for extending AWorld to support collective intelligence, expert societies of specialized agents, and autonomous self-improvement. Key research directions include multi-agent collaboration in heterogeneous environments, domain-specific agent specialization, and continuous, self-sustaining learning loops.

Conclusion

AWorld addresses the central bottleneck in agentic AI training by enabling efficient, distributed experience generation and seamless integration with RL frameworks. The framework's design and empirical validation establish rollout efficiency as the primary determinant of agentic performance on complex tasks. AWorld provides a practical blueprint for scalable agentic AI development and opens avenues for future research in collective and self-improving intelligence.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, framed to be concrete and actionable for future research.

Generalizability beyond GAIA: Evaluate AWorld across diverse, realistic agentic environments (e.g., WebArena, OSWorld, web shopping, coding agents) and report cross-domain performance and failure characteristics.
Multi-agent capabilities: The framework supports inter-agent communication, but there are no multi-agent experiments; quantify collaboration benefits, scalability, role assignment, and coordination protocols under varying team sizes and topologies.
Reward design limitations: RL uses a binary exact-match reward; investigate process rewards, partial credit, reward shaping, learned evaluators (RLAIF), and robustness to noisy or incomplete ground-truth.
Absence of learning curves and sample efficiency: Provide learning curves versus number of trajectories/updates; measure sample efficiency (success per rollout) and data-efficiency gains attributable to AWorld.
SFT vs RL contribution: Ablate the relative impact of the 886-trajectory SFT warm-start versus subsequent RL; quantify the minimal SFT needed and its interaction with rollout scale.
Adaptive rollout allocation: Study dynamic rollout budgets per task (e.g., bandit or uncertainty-based allocation) instead of fixed 32 rollouts; quantify cost-benefit and diminishing returns.
Baseline fairness and alternatives: The single-node baseline is sequential; compare against optimized single-node parallel/asynchronous executors, Ray-based runners, or other distributed rollout frameworks to isolate AWorld’s architectural advantages.
Scaling limits and curves: Report throughput/concurrency scaling with cluster size (pods, GPUs, CPUs), network bandwidth, and inference server capacity; identify bottlenecks (trace server, storage, scheduler) and Amdahl’s/Gustafson’s limits.
Failure modes and reproducibility: Characterize tool/environment failures, timeouts, non-determinism, and state drift; implement deterministic replay, checkpointing, and fault injection tests with quantitative reliability metrics.
Cost, energy, and carbon footprint: Provide compute hours, dollar costs, and energy usage per point of improvement to assess practical viability and efficiency; compare to alternative training recipes.
Safety, security, and compliance: Evaluate sandbox isolation, tool permissioning, data exfiltration risks, web interaction ethics/ToS compliance, and red-teaming; define and measure safety guardrails.
Data curation transparency: The SFT dataset (886 trajectories) curated via Claude is not described in detail; release or document selection criteria, domain coverage, potential GAIA leakage/overlap, bias, and licensing constraints.
Tool use analytics: Analyze per-tool usage, latency, error rates, and contribution to success; perform ablations on tool availability and tool-selection policies to identify high-impact integrations.
Algorithmic details and stability: Report GRPO hyperparameters, advantage estimation, variance reduction, on-/off-policy considerations, policy lag in distributed updates, and convergence diagnostics under long horizons.
Exact-match evaluation risks: GAIA correctness via exact match may penalize semantically correct but differently formatted answers; add tolerant/semantic evaluation and human assessment for borderline cases.
Test contamination and splits: Clarify training/evaluation splits on GAIA (val vs test) to avoid leakage; quantify statistical significance and uncertainty (confidence intervals) given the small test size (165 questions).
Generalization beyond search tasks: xbench-DeepSearch results suggest generalization, but broader out-of-domain tests (code reasoning, math, multimodal tasks) are missing; design systematic cross-domain evaluation.
Long-horizon credit assignment: Explore hierarchical RL, options/skills, memory mechanisms, and reflection to improve credit assignment over extended trajectories; compare to flat policy gradients.
Latency–throughput trade-offs: Detail vLLM batching, scheduling policies, and Kubernetes autoscaling; quantify pipeline stalls and end-to-end latency vs throughput trade-offs under varied load.
Multi-modality training: Tools include image/audio servers, but multimodal RL training and evaluation are not reported; assess VLM integration and performance on multimodal GAIA items.
Agent memory and retrieval: The framework mentions memory/context assembly, but there is no evaluation of persistent memory, retrieval augmentation, or their effect on long-horizon performance.
Central trace server as a bottleneck: Examine the trace server’s scalability, redundancy, and failure recovery; evaluate distributed logging/metadata services to remove single points of failure.
Deployment on modest hardware: Study performance on CPU-only clusters or smaller GPU setups; quantify what configurations make AWorld accessible to non-enterprise users.
Security of code execution: Assess sandbox escape risks, resource abuse, and privilege separation in e2b-code-server/terminal-controller; propose and test hardened policies.
Answer extraction and finalization: Document how final answers are extracted/validated from trajectories; compare different answer finalization strategies and their impact on correctness.
Pass@k plateau analysis: Investigate why pass@k gains plateau around 10–15 rollouts; develop policies or planning strategies that reduce required rollouts rather than merely accelerating them.
Comparative framework benchmarks: Benchmark AWorld against existing agents frameworks (LangChain, OpenAgents, OpenAI Agents SDK, AutoGPT) for rollout efficiency, reliability, and training integration.
Full reproducibility kit: Release seeds, exact versions, Kubernetes manifests, container images, and configs required to deterministically reproduce the reported 14.6× speedup and GAIA improvements.

View Paper Prompt View All Prompts

Glossary

A2A protocol: Google's Agent-to-Agent communication protocol for coordinating multiple agents. "inter-agent communication (e.g., Google's A2A protocol~\citep{googlea2a})."
ActionModel: A structured representation of an agent’s chosen action within the system’s messaging payloads. "payload (Any): Main content (e.g., ActionModel, Observation, TaskItem)."
Advantage estimation: An RL technique to compute how much better an action is compared to a baseline for gradient updates. "we employ the GRPO algorithm for advantage estimation and gradient update computation."
Agent-as-tool: A design where one agent is exposed as a callable tool for another agent. "including browser-based interfaces, terminal emulators, and agent-as-tool functionalities."
Agent Construction: The process of defining an agent’s logic, tools, prompts, and planning capabilities. "Agent Construction: We begin at the foundational level by simplifying the instantiation of an individual agent, defining its core logic, toolset, and planning capabilities."
Agentic AI: AI systems that act as autonomous agents interacting over long trajectories in complex environments. "AWorld, a framework for Agentic AI designed around the ``learning from practice'' paradigm."
AReaL: A reinforcement learning framework integrated for training agents. "including OpenRLHF~\citep{hu2024openrlhf}, VeRL~\citep{sheng2025hybridflow}, AReaL~\citep{fu2025areal}, and SWIFT~\citep{zhao2025swift}"
AWorld Executor: The distributed rollout engine that executes agent tasks concurrently across environments. "As shown in \cref{tab:rollout_time}, the efficiency gains are substantial. The AWorld Executor completes the rollout phase in just $525$ seconds,"
Backward Pass: The training flow where collected trajectories are used to update the agent via RL. "a Backward Pass (bottom; \ding{175}-\ding{177}), where these trajectories are used as experience to train agents and optimize the entire system via reinforcement learning."
Combinatorial action space: A large set of possible actions and parameters that agents must search over. "characterized by a combinatorial action space of diverse tools and their parameters,"
DeepSpeed ZeRO3: A memory-optimization strategy enabling large-model training across GPUs. "memory-intensive optimization strategies such as DeepSpeed ZeRO3~\citep{rajbhandari2020zero}."
Distributed architecture: A system design that spreads computation and state across multiple nodes for scalability. "AWorld adopts a distributed architecture that prioritizes robustness and scalability."
End-to-End learning-from-practice pipelines: Complete workflows from interaction to training that continuously improve agents. "position AWorld as an ideal platform for implementing End-to-End learning-from-practice pipelines,"
Epoch timestamp: A time value representing seconds since the Unix epoch used for message creation. "timestamp (float): Epoch timestamp for message creation."
Event-driven communication: A messaging pattern where agents react to events to coordinate and delegate tasks. "This event-driven communication enables agents to notify, coordinate, and delegate tasks dynamically,"
Exploration (rollout): The RL phase where agents interact with environments to collect experience. "A standard RL algorithm, such as GRPO, generally involves three key stages: exploration (rollout), feedback (reward), and learning (gradient update)."
Forward Pass: The execution flow where agents interact with environments to generate trajectories. "a Forward Pass (top; \ding{172}-\ding{174}), where agents are constructed and interact with complex environments to generate task-solving trajectories;"
GAIA benchmark: A challenging evaluation suite for agentic AI with complex, multi-step tasks. "on the challenging GAIA benchmark~\citep{mialon2023gaia}, even closed-source models like GPT-4~\citep{achiam2023gpt} achieve only 3.99\% accuracy,"
GRPO: A policy-gradient RL algorithm used for exploration, reward modeling, and updates. "A standard RL algorithm, such as GRPO, generally involves three key stages: exploration (rollout), feedback (reward), and learning (gradient update)."
Headless Excel engine: An Excel-processing service without a GUI for reading/writing .xlsx sheets. "excel & Lightweight headless Excel engine that reads and writes on .xlsx sheets."
Heterogeneous workloads: Diverse task types and resource demands managed across a cluster. "By efficiently managing heterogeneous workloads and maximizing resource utilization,"
High Concurrency: The capability to run many agent tasks simultaneously across a cluster. "High Concurrency."
Inference engine: The serving component that generates actions from the current model during rollout. "interacts with the inference engine of the RL framework to query actions at each step."
Inter-agent communication: Messaging between separate agents for coordination and collaboration. "inter-agent communication (e.g., Google's A2A protocol~\citep{googlea2a})."
Intra-agent communication: Interaction between an agent’s internal components, such as models and tools. "intra-agent communication between models and tools (e.g., Anthropicâs MCP~\citep{mcp2024});"
Kubernetes: A container orchestration system used for scheduling parallel rollout environments. "managed by Kubernetes"
LLMs: Powerful neural models trained on large corpora for language tasks. "LLMs~\citep{anthropic2024claude37sonnet,touvron2023llama,team2023gemini}"
Long-horizon reasoning: Multi-step reasoning across extended sequences of actions and observations. "and the necessity for long-horizon reasoning across extended trajectories."
Message object: The core data structure for sending messages between agents, users, and tools. "AWorld utilizes the Message object as the core abstraction to unify three primary communication channels:"
Model Context Protocol (MCP): A protocol for standardized tool/model interaction contexts. "including available tools (such as MCP tools),"
Multi-agent workflows: Task flows where multiple agents coordinate and collaborate. "supporting both single-agent and multi-agent workflows."
OpenRLHF: An RLHF framework integrated with AWorld for training pipelines. "including OpenRLHF~\citep{hu2024openrlhf}, VeRL~\citep{sheng2025hybridflow}, AReaL~\citep{fu2025areal}, and SWIFT~\citep{zhao2025swift}"
Pass@1: The metric measuring success rate when the agent makes a single attempt. "It achieves a pass@1 score that is highly competitive with frontier proprietary models like GPT-4o."
Pass@k: The metric measuring success rate when the agent is allowed k attempts. "We plot the pass@k success rate for three leading models on the full $165$-question GAIA validation set,"
Pod: A Kubernetes execution unit encapsulating an environment for agent rollout. "execution unit known as a pod."
Pub-sub patterns: A messaging paradigm where senders publish to topics and receivers subscribe. "topic (Optional[str]): Topic-based routing channel (for pub-sub patterns)."
Qwen3-32B: An open-source foundation LLM serving as the base model for the agent. "By using the AWorld framework to fine-tune and conduct reinforcement learning on the Qwen3-32B base model,"
Reinforcement Learning (RL): A learning paradigm where agents improve via feedback from environment interactions. "There are two commonly used methods to improve the capabilities of the underlying LLM in post-training phase, namely SFT and RL."
Resource contention: Performance degradation due to competing processes for limited CPU/memory. "leads to severe resource contention and process instability,"
Rollout: The process of executing agent actions in an environment to generate experience trajectories. "experience generation (rollout) phase"
Rollout module: The component in RL frameworks that manages environment interaction during exploration. "Specifically, the rollout module in conventional frameworks is replaced with the AWorld Executor."
Sandboxed environment: An isolated execution environment for secure, reproducible tool actions. "interact with a sandbox, thereby ensuring secure and reproducible execution."
State consistency: Ensuring coherent and synchronized task state across distributed nodes. "State consistency across distributed nodes is maintained via synchronized remote data storage and a centralized trace server,"
Supervised Fine-Tuning (SFT): Post-training using labeled data to prime an agent’s initial policy. "SFT typically requires high-quality human-labeled data or a complex data synthesis pipeline,"
SWIFT framework: An RL training framework integrated for fine-tuning and updates. "processed by the SWIFT framework~\citep{zhao2025swift}, which orchestrates the model's fine-tuning and reinforcement learning updates."
Topic-based routing channel: A messaging field used to route messages via named topics. "topic (Optional[str]): Topic-based routing channel (for pub-sub patterns)."
Trace server: A centralized service capturing traces and metadata for distributed execution. "and a centralized trace server,"
Train-inference decoupled architecture: A design separating training resources from inference/rollout resources. "Our setup employs a train-inference decoupled architecture to optimize resource utilization for agent training."
Training Orchestration: The coordination of rollout, reward, and gradient update across systems. "Training Orchestration: Finally, to complete the ``learning from practice'' loop, the framework channels the vast experiential data generated by the runtime into external training modules,"
UUID: A universally unique identifier used to tag messages. "id (str): Unique message identifier (UUID)."
vLLM: A high-throughput LLM serving system used during rollouts. "AWorld leverages the vLLM~\citep{kwon2023efficient} to manage high-throughput agent inference and interaction with the environment."
VeRL: A reinforcement learning framework integrated for agent training. "including OpenRLHF~\citep{hu2024openrlhf}, VeRL~\citep{sheng2025hybridflow}, AReaL~\citep{fu2025areal}, and SWIFT~\citep{zhao2025swift}"
Wall-clock time: Actual elapsed time measured for a full rollout-training cycle. "We measure the wall-clock time for a full cycle of experience generation and model training,"
Worker nodes: The cluster machines that execute agent rollouts and tasks. "across a distributed cluster of worker nodes."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (17)

First 10 authors:

Collections

Tweets

alphaXiv

AWorld: Orchestrating the Training Recipe for Agentic AI (47 likes, 0 questions)

AWorld: Orchestrating the Training Recipe for Agentic AI

Summary

AWorld: Orchestrating the Training Recipe for Agentic AI

Introduction

Framework Architecture

Distributed Rollout and Experience Generation

Empirical Analysis: Rollout Scale and Agent Performance

Efficiency Gains and Training Outcomes

Practical and Theoretical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Glossary

Open Problems

Continue Learning

Related Papers

Authors (17)

Collections

Tweets

alphaXiv