Papers
Topics
Authors
Recent
Search
2000 character limit reached

Open-World Multi-Task Agents

Updated 21 February 2026
  • Open-world multi-task agents are autonomous systems that tackle an unbounded variety of tasks in dynamic, partially known environments.
  • They employ hierarchical skill libraries and graph-based planning to decompose long-horizon objectives into manageable sub-tasks.
  • Recent frameworks show enhanced performance via adaptive memory, multi-agent collaboration, and robust dependency learning in simulated benchmarks.

Open-world multi-task agents are autonomous systems designed to perceive, plan, and act across a vast—and often growing—space of tasks and environments, including challenges where the full set of goals, dependencies, and environmental contingencies is unknown at design time. Unlike traditional single-task or closed-set agents, open-world multi-task agents must operate under partial knowledge, tackle compositional and long-horizon objectives, leverage diverse forms of memory and adaptation, and often coordinate with other agents or humans. Recent research has produced a variety of architectures and evaluation frameworks benchmarking these capabilities in high-fidelity worlds such as Minecraft, custom simulators, and realistic virtual cityscapes.

1. Foundations: Definition, Formalism, and Environment Classes

Open-world multi-task agents are defined by their ability to pursue a diverse (potentially unbounded) set of tasks TT within interactive, dynamic environments EE, without exhaustive pre-encoding of all possible goals or preconditions. Formally, the task space is often parameterized as a family of POMDPs or MDPs indexed by task τ∈T\tau\in T, with each τ=(g,I)\tau = (g, I) defined via a goal specification gg (e.g., "craft iron_pickaxe") and an initial environment state II (Yuan et al., 2023).

Key environmental domains include:

The core problem is to design a policy (or policy family) π\pi that, given any task specification and observed world state, generates actions ata_t that maximize cumulative return under extreme diversity and uncertainty.

2. Compositional Skill Libraries and Planning Graphs

Modern open-world agents rely on hierarchical decomposition, where long-horizon tasks are expressed as compositions of parameterizable skills or sub-routines. This architecture is reflected across both RL- and LLM-based agents:

Skill Library Construction and Use:

  • Primitive skills include atomic actions such as mine, craft, navigate, detect, each parameterized over objects and contexts (Liu et al., 2024, Ziliotto et al., 2024).
  • Compositional skills are recursively built by sequencing and conditioning on preconditions—for example, mineDiamond invokes craftIronPickaxe if needed, then mine("diamond_ore") (Liu et al., 2024).
  • Skills can be discovered and trained via RL with tailored intrinsic rewards, including exploration bonuses (e.g., state-count, CLIP similarity) that focus learning on task-relevant behaviors (Yuan et al., 2023).

Graph-Based Planning and Dependency Modeling:

  • Skill dependencies are encoded as a directed acyclic graph G=(V,E)G=(V,E), where edges represent preconditions or resource flows between skills or subtasks (Yuan et al., 2023, Dong et al., 2024).
  • Planning proceeds via graph traversal algorithms (typically backward DFS or similar), constructing an executable sequence that satisfies causal, spatial, and (if present) temporal dependencies (Yuan et al., 2023, Dong et al., 2024).
  • Multi-agent extensions explicitly coordinate team assignment across the DAG to minimize bottlenecks, enforce resource separation, and synchronize on shared subgoals (Dong et al., 2024).

This compositionality is central to scaling agents to solve tasks requiring tens to hundreds of sequential and parallel subgoals.

3. Memory, Adaptation, and Interactive Feedback

As task diversity increases, open-world agents incorporate mechanisms for stateful adaptation, error diagnosis, and interactive re-planning. Memory and feedback mechanisms play a pivotal role:

Episodic and Semantic Memory:

Interactive Planning and Self-Improvement:

  • Systems such as DEPS (Wang et al., 2023) and JARVIS-1 (Wang et al., 2023) employ a continual feedback loop: upon subgoal failure, the agent describes current state, explains the failure ("Describe, Explain"), prompts replanning ("Plan"), and adaptively selects among candidate goals ("Select").
  • Fine-grained, failure-aware operation memory records successes and failures at the (task, operation) granularity, informing both retry strategies and revision of skill graphs (Lee et al., 30 May 2025).
  • Self-instruct curricula and active knowledge base expansion (as in LASP (Chen et al., 2024)) incrementally grow agent competence by integrating error-driven learning of missing preconditions and effects.

Mixed-Initiative and Human-in-the-Loop Protocols:

  • Interactive benchmarks (MineNPC-Task (Doss et al., 8 Jan 2026)) formalize task templates, clarify ambiguous parameters through targeted queries, and support bounded repair attempts, emulating robust human-agent mixed-initiative collaboration.

4. Robustness, Autonomy, and Dependency Learning

Robust open-world operation is challenged by incomplete or hallucinated knowledge; recent frameworks have introduced explicit mechanisms for dependency learning and revision:

Approach Mechanism Advances
REPOA (Lee et al., 30 May 2025) Adaptive Dependency Learning, FFOM, DEX Learning skill/item dependency graphs from scratch; robust to LLM hallucinations and sample-efficient
LASP (Chen et al., 2024) LLM-Augmented Replanning Diagnoses execution errors and integrates newly discovered preconditions from LLM suggestions
Plan4MC (Yuan et al., 2023) LLM-extracted skill graph + RL Static extraction of skill graph from LLM, eliminating runtime hallucinations
Odyssey (Liu et al., 2024) Fine-tuned LLM + skill retrieval Actor–planner–critic LLM loop with semantic embedding-based skill selection

These systems decouple (unreliable) external knowledge sources from empirical learning and enable on-line graph correction as missing dependencies or incorrect plans are detected. REPOA, in particular, introduces RevisionByAnalogy updates when repeated failures occur, triggering graph reconfiguration by analogy to similar well-explored items (Lee et al., 30 May 2025).

5. Multi-Agent Collaboration and Organizational Structure

Complex open-world task domains increasingly require multi-agent solutions, where coordination and division of labor become critical:

Graph-Based Coordination:

VillagerAgent models collaboration as assignment over a dynamically expanded DAG, with an LLM-based Controller distributing subtasks to agents according to dependency resolution, spatial deconfliction, and load-balancing heuristics. This reduces hallucinations, improves parallelism, and scales up to complex tasks with severe inter-agent dependencies (e.g., synchronized construction) (Dong et al., 2024).

Hierarchical and Self-Organizing Structures:

S-Agents employs a directed tree ("Tree of Agents") structure, an hourglass information bottleneck for progress monitoring and hierarchical planning, and fully asynchronous, non-obstructive collaboration to avoid round-based synchronization delays (Chen et al., 2024). This enables resilience, scalability, and recovery from partial failures.

Open-Ended Learning with Multiple Agents:

Decentralized meta-RL agents, trained over procedurally generated task trees with staged reward, exhibit emergent collective exploration, role division, and strong transfer to novel objects and deep task trees, even without explicit centralized coordination (Bornemann et al., 2023).

6. Evaluation Frameworks, Benchmarks, and Empirical Results

Progress in open-world multi-task agents is measured via a combination of tailored datasets, scenario generators, and diagnostic metrics:

Benchmark Suites:

  • Polycraft World AI Lab (PAL) (Goss et al., 2023) and MineNPC-Task (Doss et al., 8 Jan 2026) offer extensive scenario scripting, bounded-knowledge evaluation, and detailed logging for planning, skill transfer, and lifelong learning.
  • Odyssey (Liu et al., 2024) defines long-term planning, dynamic-immediate planning, and autonomous exploration tasks, each with dedicated success, diversity, and efficiency metrics.

Key Empirical Findings:

  • Odyssey's integrated LLM skill library and retrieval improves success rates on tasks like "ObtainDiamond" to 92.5% @10 minutes, outperforming prior LLM agents (DEPS: 0.6–67.5% depending on task) (Liu et al., 2024).
  • JARVIS-1's memory-augmented multimodal LLM demonstrates over 6% reliability on "ObtainDiamondPickaxe" in 20 min versus ≈2.5% for prior LLM planners (Wang et al., 2023).
  • REPOA, even with learned rather than oracle dependencies, achieves 0.54 average SR across 67 open-world tasks, surpassing all prior methods using oracle graphs (Lee et al., 30 May 2025).
  • Multi-agent frameworks like VillagerAgent achieve up to 3× improvements in completion rates and efficiency over baselines for highly interdependent construction and process tasks in Minecraft (Dong et al., 2024).
  • S-Agents demonstrates that fully asynchronous, tree-organized teams outperform linear or cyclic organizations in both time-to-completion and mean prompt usage for collaborative building/resource-collection tasks (Chen et al., 2024).

7. Future Directions and Open Challenges

Current systems highlight numerous open challenges and trajectories for research:

The ongoing integration of hierarchical skill libraries, robust memory architectures, graph-based coordination, and adaptive, interactive planning loops continues to drive progress towards truly generalist open-world multi-task agents.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open-World Multi-Task Agents.