- The paper introduces a novel multi-agent LLM framework that enhances collaborative embodied reasoning using the mind platform and MineCollab benchmark.
- It details a sophisticated system incorporating 47 parameterized tools and a Retrieval Augmented Generation system for effective agent communication and resource management.
- Experimental results highlight that achieving task success relies on strategic planning and effective inter-agent communication, underscoring the need for advanced coordination methods.
Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning
The paper "Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning" introduces a novel approach to collaborative AI agent design, focusing on embodied reasoning tasks in the open-world game Minecraft. This framework is realized through two main contributions: the "mind" platform and the "MineCollab" benchmark, both aimed at enhancing the collaborative abilities of LLM agents.
Introduction and Problem Statement
The research highlights the intrinsic complexity of multi-agent collaboration, particularly in contexts requiring both embodied reasoning and linguistic communication. Key challenges include developing agents that can effectively share knowledge, manage resources, and adapt plans based on real-time environmental feedback. Existing LLM agents falter in environments demanding sequential decision-making and natural language interaction, indicating the necessity for new methodologies beyond imitation learning.
The "mind" is a versatile platform tailored for LLM agents to simulate multi-agent collaboration within Minecraft. It provides an efficient infrastructure for task-based interaction via an extensive library of high-level tools, enhancing LLM capabilities in partially observable environments.
Figure 1: Overview of the mind workflow. A user or task configuration (left) provides instructions (e.g., “Build a house out of nearby materials”). The Agent (center) takes these instructions, consults an LLM (via a model request) and invokes high-level commands/tools. These commands are then executed in the Minecraft environment (right), with the agent receiving feedback through execution logs.
State and Action Spaces
The platform's state space is designed for effective interaction with the environment, leveraging a tool-calling approach for observation retrieval, which simplifies context management. The action space involves a set of 47 parameterized tools, allowing LLMs to perform complex tasks through abstract commands like !givePlayer and !craftItem.
Agent Architecture
Mindcraft integrates flexible modules for managing actions, environment observations, and inter-agent dialogue. This setup supports rigorous experimentation with collaboration, enhancing communication and resource management through a Retrieval Augmented Generation (RAG) system to effectively utilize past experiences.
MineCollab: Collaborative Task Benchmark
MineCollab is a benchmark suite within the mind platform that consists of three collaborative tasks: cooking, crafting, and construction. These tasks are structured to require long-term strategic planning, shared goals, and resource allocation.
Figure 2: Task suites and challenges. In this figure, we see the collaborative and embodied reasoning challenges displayed. In the cooking and crafting tasks, the agents need to delegate tasks, share resources and use embodied planning to manipulate the world of Minecraft. In the construction tasks, the agents need to navigate and coordinate in the space to ensure they consistently build towards their objective without undoing any progress the other agents have made. All together these tasks comprehensively test collaborative and embodied reasoning.
Task Design
Tasks are procedurally generated, involving complex interactions such as ingredient collection for cooking, material sharing for crafting, and blueprint adherence for construction. Success in these tasks is measured by either completing the prescribed goal or achieving minimal edit distance from the desired state.
Dataset Creation
A dataset derived from 2,000 trials with LLaMA-70B model includes successful runs and examples, facilitating SFT data generation for underperforming models. The platform allows for easy creation of additional high-quality data.





Figure 3: Cooking number of agents ablation.
Experimental Evaluation
Experiments conducted using the MineCollab tasks indicate that current state-of-the-art LLMs struggle with both embodied reasoning and communication efficacy. Performance metrics reveal that task success is heavily reliant on communication quality and strategic resource management.
Complexity Analysis
Analysis of task complexity shows a decline in success rates with more agents (due to coordination overhead) and with increased task complexity (due to horizon length and resource management challenges). The results demonstrate the inadequacy of standard LLM techniques, urging advancements in multi-agent coordination frameworks.
Conclusions
The framework presented in this paper signifies substantial advancement in multi-agent collaboration in embodied AI environments. The findings emphasize the need to develop more sophisticated methods for agent communication and coordination, paving the way for future breakthroughs in collaborative AI systems. This work sets a foundation for exploring the intersection of natural language processing and embodied AI, promoting the development of more communicative and adaptable AI agents.