Papers
Topics
Authors
Recent
Search
2000 character limit reached

SMART-LLM Replication Overview

Updated 9 December 2025
  • The paper details an end-to-end replication process that reproduces SMART-LLM's multi-stage architecture using LLM prompt engineering and rigorous benchmarking.
  • SMART-LLM replication is defined by discrete stages—from task decomposition to coalition formation and execution—ensuring transparency and auditability.
  • Empirical findings and ablation studies validate the framework's robustness, demonstrating significant performance dependencies on prompt design and coalition strategies.

SMART-LLM Replication refers to the rigorous, end-to-end process of reproducing the architecture, data flows, algorithmic stages, and benchmark results for “SMART-LLM: Smart Multi-Agent Robot Task Planning using LLMs” as originally described in (Kannan et al., 2023). The SMART-LLM pipeline defines a structured, multi-stage system for transforming high-level human instructions into executable multi-robot plans using LLM-mediated reasoning and code synthesis. Reproduction demands adherence to programmatic prompt designs, data protocols, experimental methodology, and evaluation protocols as specified in the original work.

1. System Architecture and Sequential Workflow

The SMART-LLM framework decomposes the multi-robot planning process into four strict stages—Task Decomposition, Coalition Formation, Task Allocation, and Execution—each instantiated as a discrete LLM-prompted transform (Kannan et al., 2023). The architecture is modular, with each stage producing artifacts to be passed to the next, facilitating precise replication and auditability.

Stages and Data Flow:

1. Task Decomposition - Input: Human instruction II (string), robot primitive skills Δ\Delta (Python dict), environment entities EE (Python dict). - Process: Few-shot Python prompt to LLM; output is an ordered list of sub-tasks {T1,...,TK}\{T_1, ..., T_K\}, each with required skills and arguments. - Output: Python list (of dicts) describing decomposition.

  1. Coalition Formation
    • Input: Sub-tasks {Tk}\{T^k\}, robot roster R={R1,...,RN}\mathcal{R} = \{R^1, ..., R^N\}, each with skill set SnΔS^n \subseteq \Delta.
    • Process: Few-shot prompt to LLM; output is a coalition assignment dict C={task_id:[robot_ids]}C = \{\text{task\_id}: [\text{robot\_ids}]\}.
  2. Task Allocation and Code Generation
    • Input: Sub-tasks {Tk}\{T^k\}, coalition mapping CC.
    • Process: Few-shot prompt to LLM; output is complete, executable Python code implementing the multi-robot plan. Parallelization (via threading) and ordering reflect the prior coalition mapping and temporal dependencies.
    • Output: plan.py (Python module).
  3. Execution
    • Input: plan.py script.
    • Process: Executed under simulated (AI2-THOR API) or real-robot (ROS API) environments.
    • Output: Final robot-world state for evaluation.

All programmable stages are represented by dedicated files: decomposition.py → coalition.py → allocation.py → execute.py. Code, datasets, and experiment scripts are available at https://sites.google.com/view/smart-llm/ (Kannan et al., 2023).

2. LLM Prompt Engineering and Configuration

SMART-LLM replication requires strict adherence to original prompt design and minimal model tuning. All stages use Pythonic, commented, few-shot prompts with realistic in-line examples. No fine-tuning or hyperparameter optimization is employed.

Model Choices and Decoding Parameters:

  • Supported LLMs: GPT-4 (“gpt-4”), GPT-3.5-turbo, LLaMA-2-70B (HuggingFace), Claude-3-Opus (Anthropic API) (Kannan et al., 2023).
  • Decoding: temperature = 0.0 (for determinism), max_tokens = 1500, top_p = 1.0, frequency_penalty = 0.0, presence_penalty = 0.0, n = 1.
  • Prompt templates incorporate explicit instruction boundaries, header comments, and in-context exemplars for all transforms.
  • All object, skill, and robot listings must be valid Python dicts, and chain-of-thought reasoning is implicit via verbose code comments in exemplars.

Example Stage 1 Prompt Template:

1
2
3
4
5
6
7
8
9
10
11
12
13
skills = {
  "picker": ["PickUpObject(max_mass)", "PlaceObject"],
  "mover": ["GoToLocation", "NavigateDoor"],
  ...
}
objects = {
  "laptop": {"type":"electronic","location":"desk"},
  "TV": {"type":"electronic","location":"cabinet"},
  ...
}
### Now decompose this new instruction:
instruction = "I want to turn off the desk light and start watching TV"
###
The resulting LLM output is a Python list of ordered sub-tasks, as in the original.

3. Dataset and Benchmark Specifications

SMART-LLM defines its own multi-robot planning benchmark (Kannan et al., 2023):

  • 36 task instances spanning four categories:
    • Elemental (single-robot, single-skill)
    • Simple (multi-object, sequential/parallel)
    • Compound (heterogeneous skills)
    • Complex (team formation, coordinated action)
  • Each instance specifies:
    • instruction (string),
    • floorplan (AI2-THOR scene),
    • robot roster (per-robot skill sets),
    • ground-truth goal states and required robot transitions.
  • Data format is JSON. Code and generator scripts are provided for full recreation.

Benchmark Dataset Structure Table

Field Description Example
instruction High-level task string "Take all mugs to sink"
floorplan AI2-THOR scene identifier "FloorPlan1_Scene3"
robots List of {'id', 'skills'} dict [{"id":"r1","skills":[...]}]
ground_truth_states Final object/world state [{"object":"mug", ...}]
gt_transitions Number of robot-group switches 2

4. Experimental Protocols and Metrics

Experiments are executed in both simulation (AI2-THOR v3.2, Python 3.8) and with real robots (TurtleBot3 for ground, DJI Mavic for aerial). No difference in prompts between the modalities; only the execute.py backend shifts from AI2-THOR API to ROS APIs (Kannan et al., 2023).

Evaluation Metrics (also reproduced below in table format):

  • Executability (Exe\mathrm{Exe}): fraction of planned actions actually executed,

Exe=#actions actually executable#planned actions\mathrm{Exe} = \frac{\#\text{actions actually executable}}{\#\,\text{planned actions}}

  • Goal Condition Recall (GCR): 1 minus the fraction of unmet goal conditions,

GCR=1SGTSESGT\mathrm{GCR} = 1 - \frac{|S_{GT} \setminus S_{E}|}{|S_{GT}|}

  • Task Completion Rate (TCR): binary, 1 if GCR = 1, 0 else.
  • Robot Utilization (RU): 1 minus normalized deviation in robot transitions,

RU=1τEτGTτGT\mathrm{RU} = 1 - \frac{|\tau_E-\tau_{GT}|}{\tau_{GT}}

  • Success Rate (SR): 1 iff GCR=1 and RU=1, else 0.
Metric Definition Range
Executability Executed/Planned actions\text{Executed} / \text{Planned actions} [0,1]
GCR 1SGTSE/SGT1 - |S_{GT} \setminus S_E| / |S_{GT}| [0,1]
TCR 1 if GCR=1, else 0 {0,1}
RU 1τEτGT/τGT1 - |\tau_E-\tau_{GT}|/\tau_{GT} [0,1]
SR 1 if GCR=1 \wedge RU=1, else 0 {0,1}

Main results (AI2-THOR, GPT-4):

  • Elemental: SR=1.00, TCR=1.00, RU=1.00
  • Simple: SR=0.62, TCR=1.00, RU=0.62
  • Compound: SR=0.69, TCR=0.76, RU=0.92
  • Complex: SR=0.71, TCR=0.85, RU=1.00

Ablation studies indicate 20–40% reduction in SR by removing code comments or summary blocks; skipping coalition formation drops SR to 0.60.

5. Algorithmic Details and Utility Functions

All substantive reasoning relies on unmodified LLM outputs prompted by few-shot, Pythonic exemplars. For reference, a utility function (not invoked by the original system, but analytically useful) is given:

U(Rj,Tk)=SjTSkTSkU(R_j, T^k) = \frac{|S_j \cap T_S^k|}{|T_S^k|}

Robots with U=1U=1 are preferred for direct assignment; U<1U<1 triggers coalition combination.

Task decomposition and coalition assignment are supported by direct function calls, e.g.:

1
2
3
4
5
def DecomposeTask(I, Δ, E):
    prompt = build_decomp_prompt(Δ, E, examples, I)
    response = LLM.generate(prompt)
    T = parse_subtasks(response)
    return T
All logic—including handling of environment, skills, mappings, and ordering—is LLM-mediated. No use of optimization solvers or symbolic planners is present.

6. Implementation Details and Replication Checklist

Code base is structured as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
smart_llm/
  data/
    benchmark.json
  prompts/
    decomp_examples.py
    coalition_examples.py
    alloc_examples.py
  src/
    decomposition.py
    coalition.py
    allocation.py
    execute.py
  requirements.txt
  Dockerfile
  README.md
Dependencies (minimal version specification):

  • openai>=0.27.0, anthropic>=1.0.0, ai2thor>=3.2, torch>=2.0, transformers>=4.30, flask

Execution requirements:

  • Prepare Python environment, set API keys, install requirements.
  • For full pipeline: run decomposition, coalition, allocation, execute in sequential order, passing output files between stages.

Reproducibility Manifest:

  • All code, prompts, benchmarks, and experimental scripts supplied (Kannan et al., 2023).
  • No fine-tuning or hyperparameter search required.
  • Dockerfile provided for environmental consistency.
  • For real robots, only substitution needed is for the execution backend.

7. Key Empirical Findings and Limitations

SMART-LLM achieves high success rates on elemental and basic tasks; substantial gains are observed relative to baselines for compound/complex classes. Notably, prompt engineering—specifically verbose commenting and summary blocks—is essential, with their removal resulting in sharply degraded metrics.

Ablation summary: | Variation | SR | TCR | GCR | RU | Exe | |-------------------|------|------|------|------|------| | Full SMART-LLM | 0.75 | 0.90 | 0.94 | 0.88 | 0.99 | | – No Comments | 0.48 | 0.65 | 0.73 | 0.75 | 0.78 | | – No Summary Blks | 0.61 | 0.74 | 0.80 | 0.78 | 0.81 | | – No Coalition | 0.60 | 0.68 | 0.75 | 0.85 | 0.82 |

Variability is minimal for deterministic temperature. On complex tasks, performance is more volatile (SR=0.48±0.40\text{SR}=0.48 \pm 0.40).

Real-robot transfer is demonstrated for vision and patrol tasks on TurtleBot3/DJI Mavic with the same prompt and planning pipeline.

Principal limitations:

  • Absence of explicit symbolic planning or failover; all logic is “LLM inside the prompt.”
  • Reliance on LLM generalization and in-context learning for coalition reasoning.
  • Task success heavily depends on prompt richness and annotation.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SMART-LLM Replication.