Open-World Multi-Task Agents
- Open-world multi-task agents are autonomous systems that tackle an unbounded variety of tasks in dynamic, partially known environments.
- They employ hierarchical skill libraries and graph-based planning to decompose long-horizon objectives into manageable sub-tasks.
- Recent frameworks show enhanced performance via adaptive memory, multi-agent collaboration, and robust dependency learning in simulated benchmarks.
Open-world multi-task agents are autonomous systems designed to perceive, plan, and act across a vast—and often growing—space of tasks and environments, including challenges where the full set of goals, dependencies, and environmental contingencies is unknown at design time. Unlike traditional single-task or closed-set agents, open-world multi-task agents must operate under partial knowledge, tackle compositional and long-horizon objectives, leverage diverse forms of memory and adaptation, and often coordinate with other agents or humans. Recent research has produced a variety of architectures and evaluation frameworks benchmarking these capabilities in high-fidelity worlds such as Minecraft, custom simulators, and realistic virtual cityscapes.
1. Foundations: Definition, Formalism, and Environment Classes
Open-world multi-task agents are defined by their ability to pursue a diverse (potentially unbounded) set of tasks within interactive, dynamic environments , without exhaustive pre-encoding of all possible goals or preconditions. Formally, the task space is often parameterized as a family of POMDPs or MDPs indexed by task , with each defined via a goal specification (e.g., "craft iron_pickaxe") and an initial environment state (Yuan et al., 2023).
Key environmental domains include:
- 3D agents in Minecraft (Yuan et al., 2023, Ziliotto et al., 2024, Liu et al., 2024, Wang et al., 2023, Wang et al., 2023), offering compositional object manipulation, crafting, and navigation.
- Physics-based simulators (e.g., SimWorld (Ren et al., 30 Nov 2025), decentralized open-ended meta-RL (Bornemann et al., 2023)) supporting procedural generation, complex social scenarios, and physical reasoning.
- Benchmarks designed for diagnostic evaluation, such as MineNPC-Task (Doss et al., 8 Jan 2026) and Polycraft World AI Lab (Goss et al., 2023).
The core problem is to design a policy (or policy family) that, given any task specification and observed world state, generates actions that maximize cumulative return under extreme diversity and uncertainty.
2. Compositional Skill Libraries and Planning Graphs
Modern open-world agents rely on hierarchical decomposition, where long-horizon tasks are expressed as compositions of parameterizable skills or sub-routines. This architecture is reflected across both RL- and LLM-based agents:
Skill Library Construction and Use:
- Primitive skills include atomic actions such as
mine,craft,navigate,detect, each parameterized over objects and contexts (Liu et al., 2024, Ziliotto et al., 2024). - Compositional skills are recursively built by sequencing and conditioning on preconditions—for example,
mineDiamondinvokescraftIronPickaxeif needed, thenmine("diamond_ore")(Liu et al., 2024). - Skills can be discovered and trained via RL with tailored intrinsic rewards, including exploration bonuses (e.g., state-count, CLIP similarity) that focus learning on task-relevant behaviors (Yuan et al., 2023).
Graph-Based Planning and Dependency Modeling:
- Skill dependencies are encoded as a directed acyclic graph , where edges represent preconditions or resource flows between skills or subtasks (Yuan et al., 2023, Dong et al., 2024).
- Planning proceeds via graph traversal algorithms (typically backward DFS or similar), constructing an executable sequence that satisfies causal, spatial, and (if present) temporal dependencies (Yuan et al., 2023, Dong et al., 2024).
- Multi-agent extensions explicitly coordinate team assignment across the DAG to minimize bottlenecks, enforce resource separation, and synchronize on shared subgoals (Dong et al., 2024).
This compositionality is central to scaling agents to solve tasks requiring tens to hundreds of sequential and parallel subgoals.
3. Memory, Adaptation, and Interactive Feedback
As task diversity increases, open-world agents incorporate mechanisms for stateful adaptation, error diagnosis, and interactive re-planning. Memory and feedback mechanisms play a pivotal role:
Episodic and Semantic Memory:
- Agents maintain structured records of past experiences (plans, states, outcomes) and retrieve contextually relevant trajectories during planning (Wang et al., 2023).
- Retrieval-augmented generation (RAG) uses similarity search over past visual and task states to inject helpful priors into the LLM or planner (Wang et al., 2023).
Interactive Planning and Self-Improvement:
- Systems such as DEPS (Wang et al., 2023) and JARVIS-1 (Wang et al., 2023) employ a continual feedback loop: upon subgoal failure, the agent describes current state, explains the failure ("Describe, Explain"), prompts replanning ("Plan"), and adaptively selects among candidate goals ("Select").
- Fine-grained, failure-aware operation memory records successes and failures at the (task, operation) granularity, informing both retry strategies and revision of skill graphs (Lee et al., 30 May 2025).
- Self-instruct curricula and active knowledge base expansion (as in LASP (Chen et al., 2024)) incrementally grow agent competence by integrating error-driven learning of missing preconditions and effects.
Mixed-Initiative and Human-in-the-Loop Protocols:
- Interactive benchmarks (MineNPC-Task (Doss et al., 8 Jan 2026)) formalize task templates, clarify ambiguous parameters through targeted queries, and support bounded repair attempts, emulating robust human-agent mixed-initiative collaboration.
4. Robustness, Autonomy, and Dependency Learning
Robust open-world operation is challenged by incomplete or hallucinated knowledge; recent frameworks have introduced explicit mechanisms for dependency learning and revision:
| Approach | Mechanism | Advances |
|---|---|---|
| REPOA (Lee et al., 30 May 2025) | Adaptive Dependency Learning, FFOM, DEX | Learning skill/item dependency graphs from scratch; robust to LLM hallucinations and sample-efficient |
| LASP (Chen et al., 2024) | LLM-Augmented Replanning | Diagnoses execution errors and integrates newly discovered preconditions from LLM suggestions |
| Plan4MC (Yuan et al., 2023) | LLM-extracted skill graph + RL | Static extraction of skill graph from LLM, eliminating runtime hallucinations |
| Odyssey (Liu et al., 2024) | Fine-tuned LLM + skill retrieval | Actor–planner–critic LLM loop with semantic embedding-based skill selection |
These systems decouple (unreliable) external knowledge sources from empirical learning and enable on-line graph correction as missing dependencies or incorrect plans are detected. REPOA, in particular, introduces RevisionByAnalogy updates when repeated failures occur, triggering graph reconfiguration by analogy to similar well-explored items (Lee et al., 30 May 2025).
5. Multi-Agent Collaboration and Organizational Structure
Complex open-world task domains increasingly require multi-agent solutions, where coordination and division of labor become critical:
Graph-Based Coordination:
VillagerAgent models collaboration as assignment over a dynamically expanded DAG, with an LLM-based Controller distributing subtasks to agents according to dependency resolution, spatial deconfliction, and load-balancing heuristics. This reduces hallucinations, improves parallelism, and scales up to complex tasks with severe inter-agent dependencies (e.g., synchronized construction) (Dong et al., 2024).
Hierarchical and Self-Organizing Structures:
S-Agents employs a directed tree ("Tree of Agents") structure, an hourglass information bottleneck for progress monitoring and hierarchical planning, and fully asynchronous, non-obstructive collaboration to avoid round-based synchronization delays (Chen et al., 2024). This enables resilience, scalability, and recovery from partial failures.
Open-Ended Learning with Multiple Agents:
Decentralized meta-RL agents, trained over procedurally generated task trees with staged reward, exhibit emergent collective exploration, role division, and strong transfer to novel objects and deep task trees, even without explicit centralized coordination (Bornemann et al., 2023).
6. Evaluation Frameworks, Benchmarks, and Empirical Results
Progress in open-world multi-task agents is measured via a combination of tailored datasets, scenario generators, and diagnostic metrics:
Benchmark Suites:
- Polycraft World AI Lab (PAL) (Goss et al., 2023) and MineNPC-Task (Doss et al., 8 Jan 2026) offer extensive scenario scripting, bounded-knowledge evaluation, and detailed logging for planning, skill transfer, and lifelong learning.
- Odyssey (Liu et al., 2024) defines long-term planning, dynamic-immediate planning, and autonomous exploration tasks, each with dedicated success, diversity, and efficiency metrics.
Key Empirical Findings:
- Odyssey's integrated LLM skill library and retrieval improves success rates on tasks like "ObtainDiamond" to 92.5% @10 minutes, outperforming prior LLM agents (DEPS: 0.6–67.5% depending on task) (Liu et al., 2024).
- JARVIS-1's memory-augmented multimodal LLM demonstrates over 6% reliability on "ObtainDiamondPickaxe" in 20 min versus ≈2.5% for prior LLM planners (Wang et al., 2023).
- REPOA, even with learned rather than oracle dependencies, achieves 0.54 average SR across 67 open-world tasks, surpassing all prior methods using oracle graphs (Lee et al., 30 May 2025).
- Multi-agent frameworks like VillagerAgent achieve up to 3× improvements in completion rates and efficiency over baselines for highly interdependent construction and process tasks in Minecraft (Dong et al., 2024).
- S-Agents demonstrates that fully asynchronous, tree-organized teams outperform linear or cyclic organizations in both time-to-completion and mean prompt usage for collaborative building/resource-collection tasks (Chen et al., 2024).
7. Future Directions and Open Challenges
Current systems highlight numerous open challenges and trajectories for research:
- Lifelong learning and memory scalability: Efficient memory management, cross-task generalization, and memory persistence across sessions remain open issues (Wang et al., 2023, Doss et al., 8 Jan 2026).
- Robust autonomy: Improving LLM-based plan validation, symbolic knowledge verification, and safe handling of incorrect knowledge injection (Chen et al., 2024, Lee et al., 30 May 2025).
- Multi-agent coordination at scale: Formalizing explicit role negotiation, robust handling of heterogeneous agents, and dynamic team restructuring for very large groups (Dong et al., 2024, Chen et al., 2024).
- Sim-to-Real and Social Reasoning: Adapting open-world agent models to high-fidelity, physically and socially realistic simulators such as SimWorld, involving naturalistic language-driven scene editing, multimodal interaction, and strategic multi-agent negotiation tasks (Ren et al., 30 Nov 2025).
- Unified evaluation protocols: Standardizing task definitions, bounded-knowledge constraints, and subtask-level metrics for fair, reproducible benchmarking (Doss et al., 8 Jan 2026, Goss et al., 2023).
The ongoing integration of hierarchical skill libraries, robust memory architectures, graph-based coordination, and adaptive, interactive planning loops continues to drive progress towards truly generalist open-world multi-task agents.