MARS Challenge: Multi-Agent Robotics

Updated 2 February 2026

Multi-Agent Robotic System Challenge is a benchmarking initiative offering formal task definitions and mathematical models for planning and control in collaborative robotics.
It integrates large-scale language and vision models to enable dynamic behavior trees, closed-loop pipelines, and modular coordination across heterogeneous teams.
Experimental evaluations use realistic simulation and real-world scenarios with metrics assessing plan accuracy, control performance, and scalability.

The Multi-Agent Robotic System (MARS) Challenge is a competitive and research-focused benchmarking initiative designed to advance the theory and practice of collaborative robotic intelligence. It provides formal testbeds and quantitative analyses of multi-agent planning, control, perception, and dialogue, as exemplified by recent deployments utilizing large (multimodal) LLMs and vision-language-action frameworks. The challenge encompasses both high-level planning informed by natural language and environmental context, as well as low-level continuous control across heterogeneous robotic teams, with extensive evaluation of coordination strategies, modular architectures, and resilience to real-world operational constraints (Kang et al., 26 Jan 2026).

1. Formal Task Definition and Mathematical Models

The MARS Challenge distinguishes two principal tracks: Planning and Control, each formulated with precise mathematical rigor:

Planning Track: Defined by input pairs $(I, V)$ , where $I$ is a natural language task instruction and $V$ a set of visual observations (images). Participants select a subset of robots $S \subseteq \mathcal{R}$ and produce a joint multi-agent plan

$P = \left\{ a_t^{(r)} : t=1, \ldots, T,\, r \in S \right\}$

with each atomic action $a_t^{(r)} = (\mathrm{ActionType},\,\mathrm{TargetObject})$ executed by robot $r$ at time $t$ . The state space $\mathcal{S}$ includes agent and object configurations, while actions are drawn from the Cartesian product of primitive types and object sets. Planning policy $\pi(S,P\mid I,V)$ maps from instruction/vision to robot selection and plan.

Control Track: Formalized as a multi-agent Markov decision process (MDP)

$I$ 0

where $I$ 1 is the global scene state, $I$ 2 is each agent’s joint-space action, $I$ 3 is the simulator-defined transition model, and $I$ 4 is a sparse success reward. Joint policy $I$ 5 is optimized for expected task success over trajectories.

The challenge leverages environments such as ManiSkill3 (for home-like manipulation and navigation tasks) and incorporates both simulated and real-world scenarios (Kang et al., 26 Jan 2026, Lykov et al., 2023).

2. Systems Architecture and Coordination Frameworks

Contestants implement multi-agent architectures combining state-of-the-art perception, reasoning, planning, and control, frequently anchored in large-scale pretrained models:

LLM-MARS: Employs a transformer-based Falcon 7B backbone with parameter-efficient fine-tuning (LoRA adapters) for separate Behavior Tree (BT) generation and QA tasks. Behavior Trees are defined as rooted, directed trees $I$ 6 where each node $I$ 7 has typed functionality (Sequence, Selector, Action, Condition) and returns results recursively according to the established BT semantics:

$I$ 8

Adapter switching enables dynamic multimodal flows, alternating between BT synthesis and execution-context QA (Lykov et al., 2023).

MARS-MLLM (Assistive Intelligence): Configured as a closed-loop pipeline across four specialist agents:
1. Visual Perception Agent: Extracts semantic (CLIP encoding) and spatial (DeepLabV3/SAM instance segmentation) features.
2. Risk Assessment & Reasoning Agent: Prioritizes hazards via weighted severity and urgency metrics.
3. Planning Agent: Translates prioritized risks into executable action sequences with feasibility and cost checks.
4. Evaluation Agent: Score and iterative refinement across UX, efficiency, transparency, and ethics. Data flows strictly from perception to evaluation, with feedback loops for plan optimization (Gao et al., 3 Nov 2025).

Other frameworks (e.g., CrewAI) employ hierarchical delegation with manager-worker patterns, strict tool-access boundaries, and structured reporting schemes. Emphasis is placed on transparency, proactive failure recovery, and contextual grounding for resilience (Bai et al., 4 Jun 2025).

3. Experimental Environments and Evaluation Metrics

The MARS Challenge employs complex, diverse environments and rigorous, multi-dimensional scoring rules:

Environments:
- Planning: VIKI-Bench tasks (RoboCasa + ManiSkill3), spanning 2–10-step tasks and heterogeneous robot teams.
- Control: RoboFactory (ManiSkill3-based), supporting manipulation tasks (“Place Cube in Cup”, “Strike Cube”, “Three Robots Place Shoes”, “Four Robots Stack Cube”) with randomized initial states (Kang et al., 26 Jan 2026).
Metrics:
- Planning Track:
$I$ 9

where

$V$ 0

and weights $V$ 1. Metrics assess correctness of robot selection, plan step accuracy, type consistency, and length matching. - Control Track:

$V$ 2

and total score is averaged across four tasks.

Leaderboard tables enumerate top-performing teams and their scores (Table 1 and Table 2 in (Kang et al., 26 Jan 2026)), demonstrating notable performance differences across manipulation tasks.

Specialized metrics:
- LLM-MARS: Compound task execution accuracy 79.28% (averaged), >90% for ≤2 subtasks, QA accuracy 72.8%, expert QA relevance/informativeness 4.71/5 and 4.89/5.
- MARS-MLLM: Best average rank of 1.93 AI / 1.33 human experts in scenario rankings; ablation studies confirm criticality of each pipeline branch (Gao et al., 3 Nov 2025).

4. Methodological Innovations and Participant Solutions

Key algorithmic strategies and frameworks from challenge entries include:

Self-Correction Framework (EfficientAI):
- Seeds model with annotated plans, generates multiple candidates via stochastic VLM sampling, pseudo-labels high-scoring plans, fine-tunes on augmented data, and applies multi-pass voting for consensus at test time.
Modular Multi-Agent Planning (TrustPath AI):
- Segregated modules for agent activation, parallel action graph generation, and syntax/capability monitoring to ensure compositional coordination.
Combo-MoE (MMLab@HKUxD-Robotics, Control Track):
- Shared VLM backbone; mixture-of-experts head with $V$ 3 experts for all nonempty arm subsets; routing and adapter mechanisms for expert fusion. Three-stage training: expert pretraining, router-adapter tuning, and joint fine-tuning.
CoVLA (INSAIT):
- Decentralized, independent VLA policies for each arm; shared visual workspace for implicit coordination; reward shaping to enforce temporal alignment and collision avoidance (Kang et al., 26 Jan 2026).
Hierarchical CrewAI Framework:
- Manager-worker delegation with explicit reporting; role-, tool-, and process-compliance enforced; design guidelines for process transparency, proactive failure recovery, and contextual grounding formulated based on observed failure modes (Bai et al., 4 Jun 2025).
Dialogue-Driven Planning (LLM-MARS):
- Operator dialogue initiates BT generation; execution context streamed back in XML for QA; multimodal switch between BT and QA adapters for fluent, informative human–robot interaction (Lykov et al., 2023).

5. Challenges, Limitations, and Future Research Directions

Persistent research challenges and open issues emerging from the MARS Challenge and associated systems include:

Coordination Scalability: Exponential growth in joint action space with team size necessitates efficient abstractions, expert decomposition, and hierarchical planning. Current methods scale up to 4 agents effectively but degrade beyond this scope (Kang et al., 26 Jan 2026).
Sim-to-Real Transfer: Progress is predominantly shown in physics-based simulators; translating policies and planning to real-world robots remains challenging due to sensor noise, dynamics variability, and safety constraints (Kang et al., 26 Jan 2026, Gao et al., 3 Nov 2025).
Natural Language Ambiguity: Robust parsing and formalization of instruction uncertainty, as well as disambiguation in multi-agent contexts, require further theoretical foundation and experimental validation.
Real-Time Replanning: Existing systems, such as LLM-MARS, encounter performance bottlenecks with >2-task commands and slow adapter switching (~40s), hindering online adaptation. Future directions include intermediate “decomposer” adapters, hierarchical planners, and fast multi-task adapter architectures (Lykov et al., 2023).
Grounding and Personalization: Systems like MARS-MLLM highlight risk-aware planning and linguistic grounding to executable skills. Limitations include fixed primitive libraries and lack of end-to-end motor policy learning; ongoing work investigates imitation-based skill grounding and continual preference learning (Gao et al., 3 Nov 2025).
Robustness and Workflow Adherence: Failures in role assignment, tool access, in-time failure recovery, and reflection accuracy persist in hierarchical frameworks. Solutions focus on structured reasoning logs, debate protocols, and adaptive monitoring (Bai et al., 4 Jun 2025).

6. Domain-Specific Applications and Generalization

MARS Challenge systems demonstrate extensibility across diverse domains:

Assistive Intelligence: Multi-agent MLLMs for smart home robots assist people with disabilities via adaptive risk assessment, scene understanding, and personalized planning (Gao et al., 3 Nov 2025).
Industrial Logistics: Modular BT frameworks (LLM-MARS) coordinate fleets of forklifts, drones, or collaborative robot lines in warehouse and Industry 5.0 contexts (Lykov et al., 2023).
Exploration and Swarm Robotics: Autonomous resource collection in unknown terrains, as exemplified by the Swarmathon, highlights decentralized algorithms for search and discovery with no global map (Ackerman et al., 2018).

Adaptability is achieved by retraining perception and risk-classification modules, extending planners to new environments, and integrating multimodal inputs (RGB, depth, segmentation, point-cloud) (Gao et al., 3 Nov 2025). Real-world deployment requires improvements in latency, motor-skill generalization, and integration of online feedback mechanisms.

7. Summary and Outlook

The MARS Challenge serves as a reference testbed for benchmarking multi-agent robotic systems integrating perception, reasoning, planning, control, and dialogue. Solutions employing LLMs and VLMs demonstrate the viability of modular, iterative, and expertise-decomposed design pipelines. Quantitative evaluations corroborate the significance of self-correction, robust spatial reasoning, and modular coordination. Continued research focuses on scaling to larger teams, bridging sim-to-real gaps, advancing hierarchical planning, and maintaining transparency and resilience in real-world applications (Kang et al., 26 Jan 2026, Gao et al., 3 Nov 2025, Lykov et al., 2023, Bai et al., 4 Jun 2025, Ackerman et al., 2018).