Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuBlE/SHOP-VRB2: Integrated Robot Simulation

Updated 20 February 2026
  • MuBlE/SHOP-VRB2 is an integrated simulation environment enabling long-horizon robot manipulation by uniting visual, language, and physical reasoning.
  • It employs a modular design leveraging MuJoCo for physics and Blender for photorealistic rendering, ensuring synchronized multimodal observations.
  • The SHOP-VRB2 benchmark offers 12,000 procedurally generated scenes with multi-step reasoning tasks to evaluate closed-loop embodied planning.

MuBlE/SHOP-VRB2 encompasses an integrated simulation environment and benchmark designed to advance long-horizon robot manipulation research that requires combined visual, language, and physical reasoning. MuBlE is an open-source, modular platform built atop robosuite, utilizing the MuJoCo physics engine for physically accurate simulation and Blender as an off-line keyframe renderer for photorealistic, physically consistent image generation. This design targets closed-loop embodied reasoning agents that need to physically interact with the environment to acquire necessary information for complex tasks, such as sorting objects by latent attributes. The accompanying SHOP-VRB2 benchmark contains 12,000 procedurally generated tabletop scenes paired with ten classes of multi-step reasoning tasks, each demanding the agent to integrate perception, symbolic planning, and physical measurement (Nazarczuk et al., 4 Mar 2025).

1. Environment Design and Data Modalities

MuBlE integrates MuJoCo and Blender via a shared scene graph that synchronizes both visual (pose, geometry) and non-visual (weight, stiffness) object attributes. MuJoCo models rigid-body and contact dynamics while Blender, operating in an off-line rendering mode, generates high-resolution RGB images (configurable, e.g., 1024×768), depth maps, segmentation masks, and realistic photometric effects such as shadows and procedural materials.

At each keyframe, a multimodal observation tuple is produced:

  • Photorealistic RGB image and corresponding depth map;
  • Per-object segmentation masks;
  • Scene-graph state including pose, orientation, 3D bounding box, and current gripper contact flags for each object;
  • Robot-centric proprioceptive signals (end-effector pose, joint angles, velocities, gripper state);
  • Physical measurements accessible through primitives such as “weigh” (object mass), “squeeze” (stiffness), and elasticity.

2. Nested Interaction Loops

MuBlE provides two hierarchically nested interaction protocols:

  • Visual–Action Loop ("Action Loop"): At this semantic level, an embodied reasoner observes rendered RGB images (optionally the scene graph and language instruction), outputs a symbolic primitive action (e.g., approach, close_gripper, weigh), and specifies target objects. The action planner computes a trajectory, and, upon completion, MuBlE renders the next keyframe.
  • Control–Physics Loop ("Physics Loop"): Operating at a user-specified high frequency, this loop receives low-level end-effector motion commands—position Δx, orientation Δθ, and gripper command—and advances the MuJoCo physics simulation, integrating the dynamics:

M(q) q˙+C(q,q˙) q˙+g(q)=τ+JTλM(q)\ \dot{q} + C(q, \dot{q})\ \dot{q} + g(q) = \tau + J^T \lambda

where MM is inertia, CC Coriolis/centrifugal, gg gravity, τ\tau is the joint torque vector, JJ Jacobian, and λ\lambda contact impulses. Sensor readouts (joint torques, contact forces, and non-visual object attributes) are reported at each timestep.

3. Physics Modeling and Controller Architecture

MuJoCo implements smooth contact via nonlinear spring-damper models and friction:

  • Contact Normal Force: fn=kndn+cnd˙f_n = k_n d^n + c_n \dot{d}
  • Tangential Friction: ftμfn\|f_t\| \leq \mu f_n; ft=ktΔxtctx˙tf_t = -k_t \Delta x_t - c_t \dot{x}_t where dd is penetration depth, μ\mu friction coefficient, knk_n/ktk_t stiffness, cnc_n/ctc_t damping.

Operational-space control governs motion, with the default controller mapping task-space motion commands into joint torques:

τ=JTFdes+(IJT(JJT)1J)τnull\tau = J^T F_{des} + (I - J^T (J J^T)^{-1} J) \tau_{null}

where FdesF_{des} is a task-space PD-controlled wrench. This structure is extensible to custom controllers.

4. SHOP-VRB2 Benchmark: Task Suite and Dataset Organization

SHOP-VRB2 defines ten classes of multi-step reasoning tasks, each scene paired with one natural-language instruction and requiring both visual and physical measurements. Tasks encompass single/multi-object weight measurement, selecting or moving objects by weight or region, performing stacking operations (including by visual relation or weight), and long-horizon sorting (e.g., order all objects from heaviest to lightest). Each task mandates close integration of perception (visual, shape, material), manipulation, and latent attribute inference.

Scenes are generated procedurally with 4–5 objects sampled from ten everyday categories and randomized among plastic, metal, glass, rubber, and wood. Physical and visual diversity is ensured via collision-free mesh–mesh checks and controlled occlusion. For additional complexity, 30 YCB benchmark real-object scenes with nine YCB-Video models are included.

For each instruction–scene pair, a symbolic program in CLEVR-IEP format is generated via backward reasoning, with primitive-action sequence lengths ranging from 5 to 46.

5. Evaluation Protocol and Baselines

Evaluation is structured around the success rate:

SuccessRate=1Ni=1NSi\text{SuccessRate} = \frac{1}{N} \sum_{i=1}^N S_i

where Si{0,1}S_i \in \{0,1\} denotes success per test episode. Success is measured per-task-type and overall. Supplementary metrics include average trajectory length E[H]\mathbb{E}[H] and cumulative reward Rcum=t=1HrtR_{cum} = \sum_{t=1}^H r_t for reinforcement learning settings.

The standard baseline is CLIER (Closed-Loop Interactive Embodied Reasoning), a transformer-based planner using scene graphs, remaining subgoals, ResNet-extracted visual features, and embedded instructions to predict the next primitive action and its targets. Ground-truth demonstrations supply expert paths with full observability (scene graph, segmentation, depth, symbolic program decomposition).

Performance outcomes are as follows:

  • SHOP-VRB2 simulation overall: 43.9% success rate.
  • YCB benchmark: 76.7% (simulated), 64.4% (real).
  • High (>65%) success on single-object tasks, low (<35%) on multi-object and weight-sorting tasks.

Frequent failure modes include execution errors (14.4%), scene inconsistencies such as object ID swaps (12.6%), and action loops arising from small pose errors (10.8%).

6. Extensibility, API, and Practical Usage

MuBlE inherits the robosuite Python API. Primary usage involves:

1
2
3
4
5
env = MuBlEEnv(config)
obs = env.reset(scene_spec, instruction)
for t in range(T):
    action = planner(obs.image, obs.scene_graph, subgoal)
    obs, reward, done, info = env.step(action)
Observations include images, depth, segmentation, physical attributes, and the scene graph. Adding new objects requires a MuJoCo XML model and corresponding Blender mesh/material; the procedural scene generator registers these for randomized trials. New tasks and instructions can be defined by extending instruction templates with decomposition rules, coupled with logic for backward validation. Sensors for new physical observables are constructed as robosuite extensions and mounted on robots or objects.

7. Research Scope and Significance

MuBlE establishes a physically realistic and photorealistic environment for long-horizon, multimodal task planning, addressing the gap in simulators that require integrated high-fidelity observations and closed-loop physical interaction. SHOP-VRB2 provides a challenging diagnostic testbed for embodied reasoning integrating vision, language, and manipulation. The modular and open-source design supports straightforward adaptation for sim-to-real transfer, new algorithmic baselines, and community-driven expansion of sensors and task classes (Nazarczuk et al., 4 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MuBlE/SHOP-VRB2.