3MDBench: Robotics & Multi-Domain Benchmark

Updated 22 February 2026

3MDBench is a suite of research benchmarks spanning robotics, telemedicine, and 3D instruction with domain-specific evaluation protocols.
It standardizes evaluation through physics-based simulations, detailed scene statistics, and a modular trajectory generation pipeline.
By providing 30,000 demonstrations and baseline models, 3MDBench fosters advancements in embodied AI and multi-modal task learning.

3MDBench designates multiple distinct research benchmarks, each serving a specific domain in AI, robotics, medicine, or scientific simulation. This article surveys major “3MDBench” systems as referenced in leading literature: (1) a large-scale robotics benchmark for joint whole-body motion generation in mobile manipulation (“M3Bench”, also referred to as “3D Mobile-Manipulation Benchmark”), (2) a medical multimodal multi-agent dialogue benchmark for LVLM-driven telemedicine (“Medical 3MDBench”), and (3) a comprehensive instruction-following benchmark around multi-modal 3D scene prompts (“M3DBench”). Each variant sets specialized requirements, methodologies, and evaluation protocols.

1. Robotics: 3D Mobile-Manipulation Benchmark (M3Bench/3MDBench)

3D Mobile-Manipulation Benchmark—or “M3Bench” (hereafter, Editor’s term: 3MDBench (Robotics))—is a large-scale platform enabling rigorous evaluation of whole-body motion generation by mobile manipulators in 3D household scenes (Zhang et al., 2024). “Whole-body motion” stipulates joint trajectories for both the mobile base and manipulator arm (including end-effector), under kinematic, environmental, and task constraints.

Core tasks require an agent, given a partial 3D scan (point cloud), a target-object mask, and an abstract task instruction (“Pick that salt shaker”), to plan and execute continuous, coordinated base-arm motions for navigation, reaching into occlusion, stable grasping or placing, and persistent collision avoidance.

Key objectives:

Standardized, physics-grounded evaluation of collision-free, kinematically feasible joint-space trajectories.
Testing of generalization: new objects, new scenes, and unseen object–scene combinations.
Support for multi-modal embodied-AI research via rich metadata: URDF annotations, natural language instructions, panoptic maps, egocentric videos.

2. Dataset, Coverage, and Scene Statistics

3MDBench (Robotics) comprises:

119 photo-realistic, physics-enabled household scans (from PhyScene), diversifying kitchen, living room, bedroom, and bathroom layouts.
32 object categories (e.g., cups, bottles, books), 588 distinct instances.
30,000 object rearrangement demonstrations: ≈20,000 pick, ≈10,000 place, each with 30 waypoints.
Data splits:
- Base: Train (75%), Validation (5%), Test (20%)
- Novel-Object: Unseen objects in observed scenes
- Novel-Scene: Seen objects in held-out scenes
- Novel-Scenario: Unseen object–scene pairs

Splits allow component-wise assessment of object- and scene-level generalization.

3. M3BenchMaker: Automatic Trajectory Generation Pipeline

The M3BenchMaker toolkit orchestrates data synthesis through four modular stages:

Task Builder: Instantiates configuration for each manipulation episode, given scene/robot URDFs, target object, initial pose, and action type (pick/place).
Conditional Scene Sampler: Samples plausible object/robot poses by solving constraints on supporting surfaces. The plane $\pi_s$ maximizing $Area(U_s \cap proj_{o,s}(U_o))/Area(U_o)$ is selected, subject to mean surface proximity and angular alignment thresholds.
Goal Configuration Generator: Proposes a set of candidate 6-DOF end-effector poses using an SE(3) energy-based model. Adaptive KD-tree–based sampling iteratively computes “feasibility scores” to locate a pose with a feasible collision-free whole-body plan.
VKC Problem Generator: Treats the robot system as a Virtual Kinematic Chain, optimizing the continuous trajectory $q(\cdot)$ over time interval $[0, T]$ :

$\min J(q(\cdot)) = \int_0^T [ w_1\|\dot{q}(t)\|^2 + w_2\|p_{ee}(q(t)) - p_{goal}\|^2 ] dt + w_3\|q(T) - q_{goal}\|^2$

subject to joint limits, collision avoidance ( $dist(S_{robot}(q(t)), S_{scene}) \ge \epsilon$ ), and end-effector goal conditions. Sequential convex optimization (TrajOpt-style) yields final, smooth joint-space waypoints.

4. Evaluation Metrics and Protocols

All evaluation is conducted in Isaac Sim, simulating realistic contacts and friction. Used metrics:

Task Success Rate: The primary metric; requires successful pick/place and stable holding of goal pose for ≥2 seconds.
Auxiliary metrics:
- Distance to goal: $\min_t \|p_{ee}(q(t)) - p_{goal}\|$
- Collision rates: percentage of waypoints with robot–environment or self-collision.
- Joint-limit violation: fraction of waypoints with $q(t)$ outside $[q_{min}, q_{max}]$ .
- Solve time: average planning/forward-pass compute time per trajectory.

5. Baseline Models and Performance Gaps

Three baseline approaches are systematically evaluated:

M3Bench-Planner ("modmp"): Hybrid of VKC motion planning, SE(3) grasp prediction, and placement heuristics. Achieves ≈20% pick success, ≈2.8% place success (Base Test), degrades modestly in Novel splits, zero joint-limit/self-collision violations, but is slow (20–30s/plan).
Motion Policy Net (MPNet↑): Learning-based extension of MPNet with SDF collision losses. Yields ≈0.07% pick, ≈0.8% place success; high collision/joint-limit violations; fast inference time (<1s).
Mobile-Skill Transformer (MPTF): Decision transformer with PointNet++ scan encoder; nearly 0% pick success, 0.15–0.25% place, high environmental collision rates.
Failure analysis demonstrates critical inability of learning systems to respect geometric constraints or coordinate base-arm motions. Even advanced hybrid planners exhibit low success and high computational latency, with sensitivity to grasp/placement proposal quality.

6. Open Challenges and Research Directions

The current state of the art reveals significant unsolved problems:

Bridging sim-to-real gaps, including sensor noise and actuation uncertainty.
Transitioning from discrete pick/place to temporally-extended, multi-object rearrangement.
Richer natural language instruction understanding/generation.
Data-efficient, tightly scene-aware models integrating perception, affordance reasoning, and continuous control.
Dynamic manipulation and soft-contact modeling for heavy/articulated objects.
Expanding to diverse robot morphologies and multi-robot scenarios.

By releasing 30,000 demonstrations, full scene annotations, and the M3BenchMaker pipeline, 3MDBench (Robotics) establishes a critical platform for benchmarking and advancing whole-body mobile manipulation in complex, real-world 3D environments (Zhang et al., 2024).

For unrelated instances of “3MDBench” in the literature, including Medical 3MDBench (Sviridov et al., 26 Mar 2025) and M3DBench for multimodal 3D instruction following (Li et al., 2023), see the corresponding literature for domain-specific details and protocols. This entry details the robotics benchmark as per community usage and the cited primary source.