JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

Published 10 Nov 2023 in cs.AI | (2311.05997v3)

Abstract: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal LLMs, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon tasks, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of $\texttt{ObtainDiamondPickaxe}$, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks. The project page is available at https://craftjarvis.org/JARVIS-1

Abstract PDF HTML Upgrade to Chat

References (59)

Citations (80)

View on Semantic Scholar

Summary

The paper introduces JARVIS-1, a memory-augmented multimodal language model that excels in planning and executing open-world tasks in Minecraft.
It employs interactive planning with self-check mechanisms and environment feedback to robustly handle long-horizon, complex challenges.
Experimental results show that JARVIS-1 outperforms GPT-based and VPT models, achieving higher task success rates and improved efficiency.

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal LLMs

Introduction

The paper presents JARVIS, an advanced open-world agent specifically designed for the complex and expansive universe of Minecraft. JARVIS incorporates cutting-edge advancements in Multimodal LLMs (MLMs) and leverages them to perform elaborate planning and motor control based on multimodal inputs. The main distinguishing feature of JARVIS is its memory-augmented architecture, which stores past experiences to enhance planning effectiveness, particularly for long-horizon tasks.

Figure 1: An illustration of how JARVIS navigates the technology tree in Minecraft, outperforming previous models.

Key Challenges in Open-World Environments

JARVIS addresses several key challenges inherent to open-world environments:

Situation-Aware Planning: Effective navigation and task accomplishment in open-world environments like Minecraft require agents to plan based on their current situation. JARVIS enhances its success in complex tasks by continuously adapting its plans based on dynamically changing conditions and resources.
Task Complexity: The vast range of tasks, from simple to highly complex, necessitates the ability to handle long-term planning with precision. JARVIS demonstrates superior performance in tasks like mining diamonds, showcasing the benefits of interactive, iterative planning.
Life-Long Learning: To cope with the high diversity and potentially infinite number of tasks, JARVIS uses its multimodal memory to store and leverage experiences, enabling effective life-long learning and adaptability without continuous retraining.
Figure 2: JARVIS's strategies to overcome various open-world challenges, including situation-aware and interactive planning.

Architecture of JARVIS

The architecture of JARVIS is a hallmark of its capability, integrating an MLM with a memory module:

Multimodal LLM: This model processes a combination of visual observations and textual inputs to generate detailed plans for the agent's tasks.
Memory-Augmented System: The memory module plays a critical role in enhancing the agent’s planning by storing and retrieving past successful plans and experiences, effectively guiding the MLM in decision-making.
Self-Improving Mechanisms: JARVIS employs a self-instruction mechanism for task generation, allowing it to continually propose and learn from new tasks autonomously.
Figure 3: The architecture of JARVIS, showcasing its memory-augmented MLM framework and self-improvement strategies.

Interactive Planning and Execution

A significant feature of JARVIS is its interactive planning capability:

Self-Check and Explain: JARVIS refines its plans using a self-check mechanism that simulates execution steps to identify potential failures before actual execution. During execution, it utilizes environment feedback to adjust its plans, minimizing error recovery times.
Query Generation for Memory Retrieval: By generating complex queries based on task instructions and current situations, JARVIS effectively retrieves relevant memory entries, which bolster its planning precision and adaptability.
Figure 4: Illustration of JARVIS's interactive planning process, integrating self-check and environment feedback mechanisms.

Performance Evaluation

The experimental results demonstrate JARVIS's superiority over existing models such as GPT-based agents and VPT models:

Task Success Rates: JARVIS consistently exhibits higher success rates across various tasks, notably outperforming baseline models in difficult, long-horizon challenges like obtaining diamond pickaxes.
Efficiency Gains: Leveraging multimodal memory leads to more efficient planning, requiring fewer re-planning steps and conserving computational resources.
Figure 5: Comparative success rates in the ObtainDiamondPickaxe task among different models over time.

Conclusion

JARVIS represents a significant step forward in the development of open-world agents. Its integration of an MLM with a rich, experience-based memory system enables highly effective planning and execution in complex domains like Minecraft. By demonstrating enhanced adaptability and precision in task execution, JARVIS sets the stage for future research and developments in generalist agents that operate in open-ended, dynamic environments. The implications for AI research are broad, promising improvements in fields requiring complex decision-making capabilities and life-long adaptability.