Papers
Topics
Authors
Recent
Search
2000 character limit reached

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

Published 10 Nov 2023 in cs.AI | (2311.05997v3)

Abstract: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal LLMs, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon tasks, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of $\texttt{ObtainDiamondPickaxe}$, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks. The project page is available at https://craftjarvis.org/JARVIS-1

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. State abstractions for lifelong reinforcement learning. In International Conference on Machine Learning, pages 10–19. PMLR, 2018a.
  2. Policy and value transfer in lifelong reinforcement learning. In International Conference on Machine Learning, pages 20–29. PMLR, 2018b.
  3. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  4. Video pretraining (vpt): Learning to act by watching unlabeled online videos. arXiv preprint arXiv:2206.11795, 2022.
  5. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022a.
  6. Do as i can, not as i say: Grounding language in robotic affordances. In 6th Annual Conference on Robot Learning, 2022b.
  7. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. arXiv preprint arXiv:2301.10034, 2023a.
  10. Groot: Learning to follow instructions by watching gameplay videos. arXiv preprint arXiv:2310.08235, 2023b.
  11. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
  12. Collaborating with language models for embodied reasoning. In NeurIPS Foundation Models for Decision Making Workshop, 2022.
  13. Clip4mc: An rl-friendly vision-language model for minecraft. arXiv preprint arXiv:2303.10571, 2023.
  14. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems Datasets and Benchmarks, 2022.
  15. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971, 2023a.
  16. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971, 2023b.
  17. Neurips 2019 competition: the minerl competition on sample efficient reinforcement learning using human priors. arXiv preprint arXiv:1904.10079, 2019a.
  18. Minerl: A large-scale dataset of minecraft demonstrations. arXiv preprint arXiv:1907.13440, 2019b.
  19. The minerl 2020 competition on sample efficient reinforcement learning using human priors. arXiv: Learning, 2021.
  20. An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.xxxx, 2023.
  21. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ICML, 2022a.
  22. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
  23. Minerl diamond 2021 competition: Overview, results, and lessons learned. neural information processing systems, 2022.
  24. Continual training of language models for few-shot learning. arXiv preprint arXiv:2210.05549, 2022a.
  25. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations, 2022b.
  26. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
  27. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  28. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022.
  29. Steve-1: A generative model for text-to-behavior in minecraft. arXiv preprint arXiv:2306.00937, 2023.
  30. Mcu: A task-centric framework for open-ended agent evaluation in minecraft. arXiv preprint arXiv:2310.08367, 2023a.
  31. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023b.
  32. Juewu-mc: Playing minecraft with sample-efficient hierarchical reinforcement learning. arXiv preprint arXiv:2112.04907, 2021.
  33. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023.
  34. Llm as a robotic brain: Unifying egocentric memory and control. arXiv preprint arXiv:2304.09349, 2023.
  35. Seihai: A sample-efficient hierarchical ai for the minerl competition. In Distributed Artificial Intelligence: Third International Conference, DAI 2021, Shanghai, China, December 17–18, 2021, Proceedings 3, pages 38–51. Springer, 2022.
  36. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553, 2020.
  37. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
  38. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pages 2661–2670. PMLR, 2017.
  39. OpenAI. Gpt-4 technical report, 2023.
  40. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  41. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
  42. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  43. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  44. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022.
  45. Adaplanner: Adaptive planning from feedback with language models. arXiv preprint arXiv:2305.16653, 2023.
  46. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  47. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  48. Self-instruct: Aligning language models with self-generated instructions, 2022.
  49. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023b.
  50. Chain of thought prompting elicits reasoning in large language models. 36th Conference on Neural Information Processing Systems (NeurIPS 2022), 2022.
  51. Spring: Gpt-4 out-performs rl algorithms by studying papers and reasoning. arXiv preprint arXiv:2305.15486, 2023.
  52. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  53. Tree of thoughts: Deliberate problem solving with large language models, 2023.
  54. Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563, 2023.
  55. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  56. Proagent: Building proactive cooperative ai with large language models. arXiv preprint arXiv:2308.11339, 2023.
  57. W. Zhang and Z. Lu. Rladapter: Bridging large language models to reinforcement learning in open worlds. arXiv preprint arXiv:2309.17176, 2023.
  58. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
  59. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023.
Citations (80)

Summary

  • The paper introduces JARVIS-1, a memory-augmented multimodal language model that excels in planning and executing open-world tasks in Minecraft.
  • It employs interactive planning with self-check mechanisms and environment feedback to robustly handle long-horizon, complex challenges.
  • Experimental results show that JARVIS-1 outperforms GPT-based and VPT models, achieving higher task success rates and improved efficiency.

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal LLMs

Introduction

The paper presents JARVIS, an advanced open-world agent specifically designed for the complex and expansive universe of Minecraft. JARVIS incorporates cutting-edge advancements in Multimodal LLMs (MLMs) and leverages them to perform elaborate planning and motor control based on multimodal inputs. The main distinguishing feature of JARVIS is its memory-augmented architecture, which stores past experiences to enhance planning effectiveness, particularly for long-horizon tasks. Figure 1

Figure 1: An illustration of how JARVIS navigates the technology tree in Minecraft, outperforming previous models.

Key Challenges in Open-World Environments

JARVIS addresses several key challenges inherent to open-world environments:

  1. Situation-Aware Planning: Effective navigation and task accomplishment in open-world environments like Minecraft require agents to plan based on their current situation. JARVIS enhances its success in complex tasks by continuously adapting its plans based on dynamically changing conditions and resources.
  2. Task Complexity: The vast range of tasks, from simple to highly complex, necessitates the ability to handle long-term planning with precision. JARVIS demonstrates superior performance in tasks like mining diamonds, showcasing the benefits of interactive, iterative planning.
  3. Life-Long Learning: To cope with the high diversity and potentially infinite number of tasks, JARVIS uses its multimodal memory to store and leverage experiences, enabling effective life-long learning and adaptability without continuous retraining. Figure 2

    Figure 2: JARVIS's strategies to overcome various open-world challenges, including situation-aware and interactive planning.

Architecture of JARVIS

The architecture of JARVIS is a hallmark of its capability, integrating an MLM with a memory module:

  • Multimodal LLM: This model processes a combination of visual observations and textual inputs to generate detailed plans for the agent's tasks.
  • Memory-Augmented System: The memory module plays a critical role in enhancing the agent’s planning by storing and retrieving past successful plans and experiences, effectively guiding the MLM in decision-making.
  • Self-Improving Mechanisms: JARVIS employs a self-instruction mechanism for task generation, allowing it to continually propose and learn from new tasks autonomously. Figure 3

    Figure 3: The architecture of JARVIS, showcasing its memory-augmented MLM framework and self-improvement strategies.

Interactive Planning and Execution

A significant feature of JARVIS is its interactive planning capability:

  • Self-Check and Explain: JARVIS refines its plans using a self-check mechanism that simulates execution steps to identify potential failures before actual execution. During execution, it utilizes environment feedback to adjust its plans, minimizing error recovery times.
  • Query Generation for Memory Retrieval: By generating complex queries based on task instructions and current situations, JARVIS effectively retrieves relevant memory entries, which bolster its planning precision and adaptability. Figure 4

    Figure 4: Illustration of JARVIS's interactive planning process, integrating self-check and environment feedback mechanisms.

Performance Evaluation

The experimental results demonstrate JARVIS's superiority over existing models such as GPT-based agents and VPT models:

  • Task Success Rates: JARVIS consistently exhibits higher success rates across various tasks, notably outperforming baseline models in difficult, long-horizon challenges like obtaining diamond pickaxes.
  • Efficiency Gains: Leveraging multimodal memory leads to more efficient planning, requiring fewer re-planning steps and conserving computational resources. Figure 5

    Figure 5: Comparative success rates in the ObtainDiamondPickaxe task among different models over time.

Conclusion

JARVIS represents a significant step forward in the development of open-world agents. Its integration of an MLM with a rich, experience-based memory system enables highly effective planning and execution in complex domains like Minecraft. By demonstrating enhanced adaptability and precision in task execution, JARVIS sets the stage for future research and developments in generalist agents that operate in open-ended, dynamic environments. The implications for AI research are broad, promising improvements in fields requiring complex decision-making capabilities and life-long adaptability.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.