RealAppliance-Bench: Appliance Simulation Benchmark
- The benchmark provides high-fidelity, photorealistic digital appliance assets fully aligned with user manuals to rigorously evaluate multimodal LLMs and embodied agents.
- RealAppliance-Bench comprises 100 appliances across 14 types, featuring detailed 3D models, high-res textures, and modular mechanism modeling for precise manipulation tasks.
- Baseline evaluations reveal strong manual understanding but significant challenges in fine-grained spatial grounding and long-horizon planning, highlighting key areas for improvement in embodied AI.
The RealAppliance dataset is a collection of 100 high-fidelity, photorealistic digital appliance assets, each faithfully aligned with a real-world user manual at both the component and program-logic level. Developed to address persistent simulation-reality gaps in appliance manipulation research, RealAppliance enables rigorous benchmarking for multimodal LLMs (MLLMs) and embodied planning agents, offering atomic-level manipulation, detailed annotation, and exact manual-model correspondence (Gao et al., 29 Nov 2025).
1. Dataset Structure and Scope
RealAppliance comprises 100 unique appliances across 14 types, spanning major kitchen, food preparation, and laundry devices. Each asset represents a specific brand/model variant as described in an official user manual. The collection encompasses diverse form-factors and interface modalities, including analog, digital, and touch controls. Appliance categories include:
- Kitchen Cooking: oven, toaster, air fryer, microwave, rice cooker, electric hot pot, deep fryer
- Food Preparation: mixer, blender, bread machine, coffee machine, kettle, ice maker
- Laundry: washing machine
Manuals, geometry, texture assets, and logic scripts are bundled per device. The dataset covers all significant control-panel designs and includes multiple real-product variants within each type.
2. Asset Fidelity and Mechanism Modeling
High-fidelity 3D assets are authored using Autodesk 3ds Max, employing TurboSmooth subdivision for polygon counts ranging from 200,000 to 2 million triangles per model. Textures are UV-unwrapped with resolutions ≥ 4K, replicating logos, scales, and interface elements. Control panels and touch areas are mapped to isolated UV islands, facilitating dynamic updates.
The assets are exported in Universal Scene Description (USD) format for NVIDIA Isaac Sim, supporting per-model Level-of-Detail (LOD) switching with both high-poly and mid-poly variants.
Mechanisms are modular, realized as Isaac Sim classes:
- Physical: spring returns, magnetic seals, mechanical triggers, knob countdowns, safety locks
- Electronic: dynamic screen textures, touch sensing, illumination, indicator LEDs, rotary actuators
Each appliance exposes a state vector (e.g., power, temperature, timer, mode) managed by logic scripts (Python/C++) operating as finite state machines. These scripts coordinate mechanism invocations, periodic callbacks, and visual updates for realistic task execution.
3. Manual-Model Alignment Methodology
Alignment rigor is maintained by a systematic process:
- Manual Collection: Source user manuals (PDF) with component diagrams, step procedures, and dimensional drawings.
- Modeling Pipeline: Extract dimensions and photographic references to guide CAD modeling; all 3D nodes are named identically to manual part lists during assembly.
- Programmatic Linking: A JSON mapping schema per appliance (
mapping.json) provides exact node-to-manual correspondence, e.g.,
This design enables direct lookup from manual sections to 3D asset nodes. Alignment is considered exact by construction, with 100% of component nodes linked to manual terminology. An alignment-accuracy metric is defined as1 2 3 4 5
{ "component_name": "Knob_Temperature", "manual_sections": ["Sec 2.1_Start-Up", "Fig 3.2_Control-Panel"], "node_path": "/root/Body/Panel/Knob_Temp" }
though no empirical value is reported.
4. Dataset Statistics, Annotation, and Organization
Key statistics:
- Appliances: 100 across 14 categories
- Operable Components: 589 total (≈ 5.9/appliance)
- Manipulation Tasks: 979 (≈ 9.8/appliance)
- Disturbance Steps: 941 (for closed-loop evaluation)
- Manual Average Length: 766.2 words
- Average Plan Length: 7.57 steps
All bounding-box part-grounding labels (COCO-style JSON) enable spatial referencing: 589 per appliance. The collection is primarily designed for zero-shot evaluation; no explicit train/val/test split is provided.
Filesystem layout:
1 2 3 4 5 6 7 8 9 10 11 12 |
RealAppliance/ ├── 001_Oven/ │ ├─ model.usd │ ├─ textures/ │ │ ├─ panel_4k.png │ │ └─ body_4k.png │ ├─ manual.pdf │ ├─ mapping.json │ └─ program.py ├── 002_Toaster/ │ └─ … └── indices.json # catalog of all 100 appliances |
5. RealAppliance-Bench: Tasks and Baseline Results
RealAppliance-Bench evaluates agent capabilities on four primary tasks:
| Task Number | Task Name | Input/Output Synopsis | Key Metrics |
|---|---|---|---|
| 1 | Manual Page Retrieval | Manual + query → page numbers | Precision, Recall |
| 2 | Open-Loop Manipulation Planning | Instruction+pages+image → atomic action sequence | Task CR, Success Rate |
| 3 | Appliance Part Grounding | Image + part name → 2D bounding box | IoU, [email protected] |
| 4 | Closed-Loop Planning Adjustment | Plan, execution, observation → next corrective action | Stepwise Success Rate |
Task 1 employs precision () and recall (). Task 2 requires atomic stepwise plan correctness; Task 3 uses IoU and [email protected]; Task 4 is assessed by adjustment step success fraction.
Baseline findings:
- Proprietary MLLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve ~87% recall/F₁ on Task 1, but drop to single-digit % success on Task 2.
- Task 3: top average IoU ≈ 12%, [email protected] ≈ 8.6%.
- Task 4: highest closed-loop stepwise SR ≈ 31% (Gemini 2.5 Flash).
- Embodied-planning baselines (Robobrain 2.0, ManualPlan, ApBot) underperform on document-related tasks but can match large models on certain physical manipulation subtasks.
- Full-process inference (Task 5): all models approach 0% success, reflecting significant error accumulation through the pipeline.
Reported challenges:
- MLLMs excel at manual understanding but lack in fine-grained grounding and long-horizon planning.
- Spatial reasoning (part grounding) is a major performance bottleneck, with most IoUs falling in [0, 0.05].
- Robust closed-loop feedback adaptation remains unsolved.
6. Intended Applications and Distribution
RealAppliance is positioned as a comprehensive testbed for:
- Zero- and few-shot assessment of multimodal LLMs requiring document, vision, and action-planning integration.
- Training and benchmarking embodied agents in a photo-realistic, physically accurate simulator with direct manual alignment.
- Generation of demonstration data for low-level policy learning via scripted expert executions.
All assets, scripts, and benchmarks are publicly released at https://realappliance.github.io/ under a non-restrictive, MIT-style academic license. This structure facilitates broad reuse across robotics, embodied AI, and multimodal LLM research (Gao et al., 29 Nov 2025).