RealAppliance-Bench: Appliance Simulation Benchmark

Updated 6 December 2025

The benchmark provides high-fidelity, photorealistic digital appliance assets fully aligned with user manuals to rigorously evaluate multimodal LLMs and embodied agents.
RealAppliance-Bench comprises 100 appliances across 14 types, featuring detailed 3D models, high-res textures, and modular mechanism modeling for precise manipulation tasks.
Baseline evaluations reveal strong manual understanding but significant challenges in fine-grained spatial grounding and long-horizon planning, highlighting key areas for improvement in embodied AI.

The RealAppliance dataset is a collection of 100 high-fidelity, photorealistic digital appliance assets, each faithfully aligned with a real-world user manual at both the component and program-logic level. Developed to address persistent simulation-reality gaps in appliance manipulation research, RealAppliance enables rigorous benchmarking for multimodal LLMs (MLLMs) and embodied planning agents, offering atomic-level manipulation, detailed annotation, and exact manual-model correspondence (Gao et al., 29 Nov 2025).

1. Dataset Structure and Scope

RealAppliance comprises 100 unique appliances across 14 types, spanning major kitchen, food preparation, and laundry devices. Each asset represents a specific brand/model variant as described in an official user manual. The collection encompasses diverse form-factors and interface modalities, including analog, digital, and touch controls. Appliance categories include:

Kitchen Cooking: oven, toaster, air fryer, microwave, rice cooker, electric hot pot, deep fryer
Food Preparation: mixer, blender, bread machine, coffee machine, kettle, ice maker
Laundry: washing machine

Manuals, geometry, texture assets, and logic scripts are bundled per device. The dataset covers all significant control-panel designs and includes multiple real-product variants within each type.

2. Asset Fidelity and Mechanism Modeling

High-fidelity 3D assets are authored using Autodesk 3ds Max, employing TurboSmooth subdivision for polygon counts ranging from 200,000 to 2 million triangles per model. Textures are UV-unwrapped with resolutions ≥ 4K, replicating logos, scales, and interface elements. Control panels and touch areas are mapped to isolated UV islands, facilitating dynamic updates.

The assets are exported in Universal Scene Description (USD) format for NVIDIA Isaac Sim, supporting per-model Level-of-Detail (LOD) switching with both high-poly and mid-poly variants.

Mechanisms are modular, realized as Isaac Sim classes:

Physical: spring returns, magnetic seals, mechanical triggers, knob countdowns, safety locks
Electronic: dynamic screen textures, touch sensing, illumination, indicator LEDs, rotary actuators

Each appliance exposes a state vector (e.g., power, temperature, timer, mode) managed by logic scripts (Python/C++) operating as finite state machines. These scripts coordinate mechanism invocations, periodic callbacks, and visual updates for realistic task execution.

3. Manual-Model Alignment Methodology

Alignment rigor is maintained by a systematic process:

Manual Collection: Source user manuals (PDF) with component diagrams, step procedures, and dimensional drawings.
Modeling Pipeline: Extract dimensions and photographic references to guide CAD modeling; all 3D nodes are named identically to manual part lists during assembly.
Programmatic Linking: A JSON mapping schema per appliance (mapping.json) provides exact node-to-manual correspondence, e.g.,
1 2 3 4 5
{ "component_name": "Knob_Temperature", "manual_sections": ["Sec 2.1_Start-Up", "Fig 3.2_Control-Panel"], "node_path": "/root/Body/Panel/Knob_Temp" }
This design enables direct lookup from manual sections to 3D asset nodes. Alignment is considered exact by construction, with 100% of component nodes linked to manual terminology. An alignment-accuracy metric is defined as

$\alpha = \frac{|\text{correct links}|}{|\text{total links}|}$

though no empirical value is reported.

4. Dataset Statistics, Annotation, and Organization

Key statistics:

Appliances: 100 across 14 categories
Operable Components: 589 total (≈ 5.9/appliance)
Manipulation Tasks: 979 (≈ 9.8/appliance)
Disturbance Steps: 941 (for closed-loop evaluation)
Manual Average Length: 766.2 words
Average Plan Length: 7.57 steps

All bounding-box part-grounding labels (COCO-style JSON) enable spatial referencing: 589 per appliance. The collection is primarily designed for zero-shot evaluation; no explicit train/val/test split is provided.

Filesystem layout:

RealAppliance/
 ├── 001_Oven/
 │    ├─ model.usd
 │    ├─ textures/
 │    │    ├─ panel_4k.png
 │    │    └─ body_4k.png
 │    ├─ manual.pdf
 │    ├─ mapping.json
 │    └─ program.py
 ├── 002_Toaster/
 │    └─ …
 └── indices.json   # catalog of all 100 appliances

Manuals are stored as PDF, mappings as JSON, with asset and manual names following a consistent file-naming convention.

5. RealAppliance-Bench: Tasks and Baseline Results

RealAppliance-Bench evaluates agent capabilities on four primary tasks:

Task Number	Task Name	Input/Output Synopsis	Key Metrics
1	Manual Page Retrieval	Manual + query → page numbers	Precision, Recall
2	Open-Loop Manipulation Planning	Instruction+pages+image → atomic action sequence	Task CR, Success Rate
3	Appliance Part Grounding	Image + part name → 2D bounding box	IoU, [email protected]
4	Closed-Loop Planning Adjustment	Plan, execution, observation → next corrective action	Stepwise Success Rate

Task 1 employs precision ( $P = \frac{TP}{TP+FP}$ ) and recall ( $R = \frac{TP}{TP+FN}$ ). Task 2 requires atomic stepwise plan correctness; Task 3 uses IoU and [email protected]; Task 4 is assessed by adjustment step success fraction.

Baseline findings:

Proprietary MLLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve ~87% recall/F₁ on Task 1, but drop to single-digit % success on Task 2.
Task 3: top average IoU ≈ 12%, [email protected] ≈ 8.6%.
Task 4: highest closed-loop stepwise SR ≈ 31% (Gemini 2.5 Flash).
Embodied-planning baselines (Robobrain 2.0, ManualPlan, ApBot) underperform on document-related tasks but can match large models on certain physical manipulation subtasks.
Full-process inference (Task 5): all models approach 0% success, reflecting significant error accumulation through the pipeline.

Reported challenges:

MLLMs excel at manual understanding but lack in fine-grained grounding and long-horizon planning.
Spatial reasoning (part grounding) is a major performance bottleneck, with most IoUs falling in [0, 0.05].
Robust closed-loop feedback adaptation remains unsolved.

6. Intended Applications and Distribution

RealAppliance is positioned as a comprehensive testbed for:

Zero- and few-shot assessment of multimodal LLMs requiring document, vision, and action-planning integration.
Training and benchmarking embodied agents in a photo-realistic, physically accurate simulator with direct manual alignment.
Generation of demonstration data for low-level policy learning via scripted expert executions.

All assets, scripts, and benchmarks are publicly released at https://realappliance.github.io/ under a non-restrictive, MIT-style academic license. This structure facilitates broad reuse across robotics, embodied AI, and multimodal LLM research (Gao et al., 29 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

RealAppliance: Let High-fidelity Appliance Assets Controllable and Workable as Aligned Real Manuals (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RealAppliance-Bench Benchmark.