Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

Published 24 Apr 2025 in cs.MA and cs.CL | (2504.17950v1)

Abstract: Collaboration is ubiquitous and essential in day-to-day life -- from exchanging ideas, to delegating tasks, to generating plans together. This work studies how LLMs can adaptively collaborate to perform complex embodied reasoning tasks. To this end we introduce MINDcraft, an easily extensible platform built to enable LLM agents to control characters in the open-world game of Minecraft; and MineCollab, a benchmark to test the different dimensions of embodied and collaborative reasoning. An experimental study finds that the primary bottleneck in collaborating effectively for current state-of-the-art agents is efficient natural language communication, with agent performance dropping as much as 15% when they are required to communicate detailed task completion plans. We conclude that existing LLM agents are ill-optimized for multi-agent collaboration, especially in embodied scenarios, and highlight the need to employ methods beyond in-context and imitation learning. Our website can be found here: https://mindcraft-minecollab.github.io/

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel multi-agent LLM framework that enhances collaborative embodied reasoning using the mind platform and MineCollab benchmark.
It details a sophisticated system incorporating 47 parameterized tools and a Retrieval Augmented Generation system for effective agent communication and resource management.
Experimental results highlight that achieving task success relies on strategic planning and effective inter-agent communication, underscoring the need for advanced coordination methods.

Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

The paper "Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning" introduces a novel approach to collaborative AI agent design, focusing on embodied reasoning tasks in the open-world game Minecraft. This framework is realized through two main contributions: the "mind" platform and the "MineCollab" benchmark, both aimed at enhancing the collaborative abilities of LLM agents.

Introduction and Problem Statement

The research highlights the intrinsic complexity of multi-agent collaboration, particularly in contexts requiring both embodied reasoning and linguistic communication. Key challenges include developing agents that can effectively share knowledge, manage resources, and adapt plans based on real-time environmental feedback. Existing LLM agents falter in environments demanding sequential decision-making and natural language interaction, indicating the necessity for new methodologies beyond imitation learning.

Mindcraft Platform

The "mind" is a versatile platform tailored for LLM agents to simulate multi-agent collaboration within Minecraft. It provides an efficient infrastructure for task-based interaction via an extensive library of high-level tools, enhancing LLM capabilities in partially observable environments.

Figure 1: Overview of the mind workflow. A user or task configuration (left) provides instructions (e.g., “Build a house out of nearby materials”). The Agent (center) takes these instructions, consults an LLM (via a model request) and invokes high-level commands/tools. These commands are then executed in the Minecraft environment (right), with the agent receiving feedback through execution logs.

State and Action Spaces

The platform's state space is designed for effective interaction with the environment, leveraging a tool-calling approach for observation retrieval, which simplifies context management. The action space involves a set of 47 parameterized tools, allowing LLMs to perform complex tasks through abstract commands like !givePlayer and !craftItem.

Agent Architecture

Mindcraft integrates flexible modules for managing actions, environment observations, and inter-agent dialogue. This setup supports rigorous experimentation with collaboration, enhancing communication and resource management through a Retrieval Augmented Generation (RAG) system to effectively utilize past experiences.

MineCollab: Collaborative Task Benchmark

MineCollab is a benchmark suite within the mind platform that consists of three collaborative tasks: cooking, crafting, and construction. These tasks are structured to require long-term strategic planning, shared goals, and resource allocation.

Figure 2: Task suites and challenges. In this figure, we see the collaborative and embodied reasoning challenges displayed. In the cooking and crafting tasks, the agents need to delegate tasks, share resources and use embodied planning to manipulate the world of Minecraft. In the construction tasks, the agents need to navigate and coordinate in the space to ensure they consistently build towards their objective without undoing any progress the other agents have made. All together these tasks comprehensively test collaborative and embodied reasoning.

Task Design

Tasks are procedurally generated, involving complex interactions such as ingredient collection for cooking, material sharing for crafting, and blueprint adherence for construction. Success in these tasks is measured by either completing the prescribed goal or achieving minimal edit distance from the desired state.

Dataset Creation

A dataset derived from 2,000 trials with LLaMA-70B model includes successful runs and examples, facilitating SFT data generation for underperforming models. The platform allows for easy creation of additional high-quality data.

Figure 3: Cooking number of agents ablation.

Experimental Evaluation

Experiments conducted using the MineCollab tasks indicate that current state-of-the-art LLMs struggle with both embodied reasoning and communication efficacy. Performance metrics reveal that task success is heavily reliant on communication quality and strategic resource management.

Complexity Analysis

Analysis of task complexity shows a decline in success rates with more agents (due to coordination overhead) and with increased task complexity (due to horizon length and resource management challenges). The results demonstrate the inadequacy of standard LLM techniques, urging advancements in multi-agent coordination frameworks.

Conclusions

The framework presented in this paper signifies substantial advancement in multi-agent collaboration in embodied AI environments. The findings emphasize the need to develop more sophisticated methods for agent communication and coordination, paving the way for future breakthroughs in collaborative AI systems. This work sets a foundation for exploring the intersection of natural language processing and embodied AI, promoting the development of more communicative and adaptable AI agents.