SMART-LLM Replication Overview
- The paper details an end-to-end replication process that reproduces SMART-LLM's multi-stage architecture using LLM prompt engineering and rigorous benchmarking.
- SMART-LLM replication is defined by discrete stages—from task decomposition to coalition formation and execution—ensuring transparency and auditability.
- Empirical findings and ablation studies validate the framework's robustness, demonstrating significant performance dependencies on prompt design and coalition strategies.
SMART-LLM Replication refers to the rigorous, end-to-end process of reproducing the architecture, data flows, algorithmic stages, and benchmark results for “SMART-LLM: Smart Multi-Agent Robot Task Planning using LLMs” as originally described in (Kannan et al., 2023). The SMART-LLM pipeline defines a structured, multi-stage system for transforming high-level human instructions into executable multi-robot plans using LLM-mediated reasoning and code synthesis. Reproduction demands adherence to programmatic prompt designs, data protocols, experimental methodology, and evaluation protocols as specified in the original work.
1. System Architecture and Sequential Workflow
The SMART-LLM framework decomposes the multi-robot planning process into four strict stages—Task Decomposition, Coalition Formation, Task Allocation, and Execution—each instantiated as a discrete LLM-prompted transform (Kannan et al., 2023). The architecture is modular, with each stage producing artifacts to be passed to the next, facilitating precise replication and auditability.
Stages and Data Flow:
1. Task Decomposition - Input: Human instruction (string), robot primitive skills (Python dict), environment entities (Python dict). - Process: Few-shot Python prompt to LLM; output is an ordered list of sub-tasks , each with required skills and arguments. - Output: Python list (of dicts) describing decomposition.
- Coalition Formation
- Input: Sub-tasks , robot roster , each with skill set .
- Process: Few-shot prompt to LLM; output is a coalition assignment dict .
- Task Allocation and Code Generation
- Input: Sub-tasks , coalition mapping .
- Process: Few-shot prompt to LLM; output is complete, executable Python code implementing the multi-robot plan. Parallelization (via threading) and ordering reflect the prior coalition mapping and temporal dependencies.
- Output: plan.py (Python module).
- Execution
All programmable stages are represented by dedicated files: decomposition.py → coalition.py → allocation.py → execute.py. Code, datasets, and experiment scripts are available at https://sites.google.com/view/smart-llm/ (Kannan et al., 2023).
2. LLM Prompt Engineering and Configuration
SMART-LLM replication requires strict adherence to original prompt design and minimal model tuning. All stages use Pythonic, commented, few-shot prompts with realistic in-line examples. No fine-tuning or hyperparameter optimization is employed.
Model Choices and Decoding Parameters:
- Supported LLMs: GPT-4 (“gpt-4”), GPT-3.5-turbo, LLaMA-2-70B (HuggingFace), Claude-3-Opus (Anthropic API) (Kannan et al., 2023).
- Decoding: temperature = 0.0 (for determinism), max_tokens = 1500, top_p = 1.0, frequency_penalty = 0.0, presence_penalty = 0.0, n = 1.
- Prompt templates incorporate explicit instruction boundaries, header comments, and in-context exemplars for all transforms.
- All object, skill, and robot listings must be valid Python dicts, and chain-of-thought reasoning is implicit via verbose code comments in exemplars.
Example Stage 1 Prompt Template:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
skills = {
"picker": ["PickUpObject(max_mass)", "PlaceObject"],
"mover": ["GoToLocation", "NavigateDoor"],
...
}
objects = {
"laptop": {"type":"electronic","location":"desk"},
"TV": {"type":"electronic","location":"cabinet"},
...
}
### Now decompose this new instruction:
instruction = "I want to turn off the desk light and start watching TV"
### |
3. Dataset and Benchmark Specifications
SMART-LLM defines its own multi-robot planning benchmark (Kannan et al., 2023):
- 36 task instances spanning four categories:
- Elemental (single-robot, single-skill)
- Simple (multi-object, sequential/parallel)
- Compound (heterogeneous skills)
- Complex (team formation, coordinated action)
- Each instance specifies:
- instruction (string),
- floorplan (AI2-THOR scene),
- robot roster (per-robot skill sets),
- ground-truth goal states and required robot transitions.
- Data format is JSON. Code and generator scripts are provided for full recreation.
Benchmark Dataset Structure Table
| Field | Description | Example |
|---|---|---|
| instruction | High-level task string | "Take all mugs to sink" |
| floorplan | AI2-THOR scene identifier | "FloorPlan1_Scene3" |
| robots | List of {'id', 'skills'} dict | [{"id":"r1","skills":[...]}] |
| ground_truth_states | Final object/world state | [{"object":"mug", ...}] |
| gt_transitions | Number of robot-group switches | 2 |
4. Experimental Protocols and Metrics
Experiments are executed in both simulation (AI2-THOR v3.2, Python 3.8) and with real robots (TurtleBot3 for ground, DJI Mavic for aerial). No difference in prompts between the modalities; only the execute.py backend shifts from AI2-THOR API to ROS APIs (Kannan et al., 2023).
Evaluation Metrics (also reproduced below in table format):
- Executability (): fraction of planned actions actually executed,
- Goal Condition Recall (GCR): 1 minus the fraction of unmet goal conditions,
- Task Completion Rate (TCR): binary, 1 if GCR = 1, 0 else.
- Robot Utilization (RU): 1 minus normalized deviation in robot transitions,
- Success Rate (SR): 1 iff GCR=1 and RU=1, else 0.
| Metric | Definition | Range |
|---|---|---|
| Executability | [0,1] | |
| GCR | [0,1] | |
| TCR | 1 if GCR=1, else 0 | {0,1} |
| RU | [0,1] | |
| SR | 1 if GCR=1 RU=1, else 0 | {0,1} |
Main results (AI2-THOR, GPT-4):
- Elemental: SR=1.00, TCR=1.00, RU=1.00
- Simple: SR=0.62, TCR=1.00, RU=0.62
- Compound: SR=0.69, TCR=0.76, RU=0.92
- Complex: SR=0.71, TCR=0.85, RU=1.00
Ablation studies indicate 20–40% reduction in SR by removing code comments or summary blocks; skipping coalition formation drops SR to 0.60.
5. Algorithmic Details and Utility Functions
All substantive reasoning relies on unmodified LLM outputs prompted by few-shot, Pythonic exemplars. For reference, a utility function (not invoked by the original system, but analytically useful) is given:
Robots with are preferred for direct assignment; triggers coalition combination.
Task decomposition and coalition assignment are supported by direct function calls, e.g.:
1 2 3 4 5 |
def DecomposeTask(I, Δ, E): prompt = build_decomp_prompt(Δ, E, examples, I) response = LLM.generate(prompt) T = parse_subtasks(response) return T |
6. Implementation Details and Replication Checklist
Code base is structured as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
smart_llm/
data/
benchmark.json
prompts/
decomp_examples.py
coalition_examples.py
alloc_examples.py
src/
decomposition.py
coalition.py
allocation.py
execute.py
requirements.txt
Dockerfile
README.md |
- openai>=0.27.0, anthropic>=1.0.0, ai2thor>=3.2, torch>=2.0, transformers>=4.30, flask
Execution requirements:
- Prepare Python environment, set API keys, install requirements.
- For full pipeline: run decomposition, coalition, allocation, execute in sequential order, passing output files between stages.
Reproducibility Manifest:
- All code, prompts, benchmarks, and experimental scripts supplied (Kannan et al., 2023).
- No fine-tuning or hyperparameter search required.
- Dockerfile provided for environmental consistency.
- For real robots, only substitution needed is for the execution backend.
7. Key Empirical Findings and Limitations
SMART-LLM achieves high success rates on elemental and basic tasks; substantial gains are observed relative to baselines for compound/complex classes. Notably, prompt engineering—specifically verbose commenting and summary blocks—is essential, with their removal resulting in sharply degraded metrics.
Ablation summary: | Variation | SR | TCR | GCR | RU | Exe | |-------------------|------|------|------|------|------| | Full SMART-LLM | 0.75 | 0.90 | 0.94 | 0.88 | 0.99 | | – No Comments | 0.48 | 0.65 | 0.73 | 0.75 | 0.78 | | – No Summary Blks | 0.61 | 0.74 | 0.80 | 0.78 | 0.81 | | – No Coalition | 0.60 | 0.68 | 0.75 | 0.85 | 0.82 |
Variability is minimal for deterministic temperature. On complex tasks, performance is more volatile ().
Real-robot transfer is demonstrated for vision and patrol tasks on TurtleBot3/DJI Mavic with the same prompt and planning pipeline.
Principal limitations:
- Absence of explicit symbolic planning or failover; all logic is “LLM inside the prompt.”
- Reliance on LLM generalization and in-context learning for coalition reasoning.
- Task success heavily depends on prompt richness and annotation.
References
- “SMART-LLM: Smart Multi-Agent Robot Task Planning using LLMs,” (Kannan et al., 2023)