Functional Manipulation Benchmark (FMB)
- FMB is a standardized, extensible framework for evaluating robotic manipulation systems through contact-rich, compositional tasks using procedurally generated, 3D-printed objects.
- It emphasizes reproducibility, generalization across geometric variations, and scalability by integrating multi-stage pipelines that assess functional grasping, compliant insertion, and multi-part assembly.
- The benchmark employs rigorous evaluation metrics—including success rate, completion time, and compliance control—and offers open-sourced CAD files, a ROS controller stack, and simulation adapters to support broad research applications.
The Functional Manipulation Benchmark (FMB) is a standardized, extensible framework for evaluating robotic manipulation systems on contact-rich, generalizable, and hierarchically compositional tasks. FMB defines a suite of multi-stage, high-precision manipulation challenges using procedurally generated 3D-printed objects, supporting rigorous assessment of skills such as functional grasping, object reorientation, compliant insertion, and multi-part assembly in settings that demand robustness to perceptual uncertainty and physical tolerance (Luo et al., 2024, Burns et al., 2024). The benchmark emphasizes reproducibility, broad object variability, and scalability, enabling comprehensive evaluation of perception, planning, and control pipelines in both learning-based and classical robotics approaches.
1. Design Principles and Objectives
FMB addresses limitations in both trivial pick-and-place tests (which lack functional and contact complexity) and overly specialized industrial insertion tasks (which have limited generality). Its primary objectives are:
- Contact-Rich Interaction: Task structures demand reasoning over force and torque, with emphasis on compliant control strategies.
- Generalization Across Geometry: Procedural object generation introduces systematic variation in shape, size, and appearance, allowing controlled studies of model robustness to novel configurations.
- Reproducibility and Accessibility: All hardware elements (CAD files, robot configuration, fixtures) and software components (data collection, evaluation scripts) are open-sourced, targeting common research platforms (e.g., Franka Panda, RealSense D405).
- Compositionality: FMB comprises not only isolated manipulation primitives but multi-stage pipelines (grasp–reorient–insert–assemble), measuring both atomic skills and their functional integration in long-horizon sequences.
2. Task Suite and Procedural Object Generation
FMB’s task suite comprises single-object multi-stage tasks and multi-object assembly scenarios, leveraging a family of parametric 3D-printed parts:
- Object Set: Nine base shape families (e.g., rectangle, star, hexagon), each with six size variants (controlled via shape parameter ), and eight color options, yielding 54 base parts and multi-piece interlocking boards.
- Task Primitives: Fundamental skills include functional grasping (selection of grasp pose according to downstream utility), repositioning/regrasp using external fixtures, and compliant insertion into precisely dimensioned slots.
- Assembly Tasks: Boards featuring 4–5 interlocking slots require ordered insertion of shape-mated parts—a controlled testbed for coordination and error compounding in hierarchical policies.
- Procedural Generation: Object placements, board positions, and part orientations are randomized according to bounded uniform distributions (e.g., part positions over cm grids, board yaw ) (Luo et al., 2024).
3. Hardware, Perceptual Pipeline, and Data Collection
FMB’s hardware and data infrastructure is specified for rigorous comparison:
- Robot Platform: Experiments are designed for the Franka Emika Panda arm (7 DoF), equipped with a parallel-jaw gripper.
- Sensing: Four RealSense D405 (RGB+D) cameras—two wrist-mounted, two stationary—supply multi-view perception; the robot’s built-in torque estimation provides N force/torque feedback.
- Human Demonstrations: Teleoperation (6D twist via SpaceMouse) collects approx. $22,550$ segmented trajectories, temporally aligned with primitive labels for skill learning.
- Recording Modalities: Each timestep includes multi-view RGB/D, end-effector , velocity, force/torque , and binary gripper state (Luo et al., 2024).
4. Evaluation Metrics and Protocols
FMB uses explicit metrics and standardized empirical procedures:
- Success Rate: For policy on task , , with indicating trial outcome.
- Generalization Gap: , contrasting performance on trained-vs-held-out objects (by shape or size); uniform sampling enables controlled generalization studies (Luo et al., 2024).
- Completion Time: Mean completion over successful trials, enabling comparison of temporal efficiency.
- Success Tolerances: For insertion, translation error is constrained to mm, and orientation (timeout-enforced); force thresholds (\textit{e.g.}, N for contact, N for misalignment, N for insertion push) parameterize compliant control APIs (Burns et al., 2024).
- Multi-Stage/Holistic Evaluation: Pipelines integrate multiple primitives; success is only logged if all stages succeed, benchmarking compounding error and closed-loop adaptability.
Table. FMB Peg-In-Hole Subtasks and Baseline Success Rates (10 Trials Each) (Burns et al., 2024)
| Shape | Yaw Sampling | GenCHiP (0-shot) | Scripted | Point-to-Point |
|---|---|---|---|---|
| Circular peg | N/A (cylindrical symmetry) | 100% | 100% | 70% |
| Star peg | Uniform in | 80% | 10% | 0% |
| Half-pipe peg | (two-fold) | 50% | 0% | 0% |
5. Sensing, Perception, and Compliance Control
- Pose Estimation: Wrist-mounted RGB and depth imaging, Fast R-CNN with FPN backbone, and 6D keypoint regression yield object pose estimates with up to 4 mm residual error.
- Adaptive Control: A Cartesian admittance interface exposes compliance, parameterized via per-axis stiffness vector . The control law is
with , diagonal gains from , sensed, target wrench. Conditional logic combines pose and force predicates for sub-motion terminations (e.g., "stop on contact," "stop if inserted") (Burns et al., 2024).
- Perception-Action Loop: Policies can call before each primitive, capturing runtime pose errors.
6. Policy Architectures and Learning Baselines
FMB provides benchmarks not only for classical models but deep learning systems:
- Baselines: Behavior cloning via ResNet-34 or Transformer architectures, leveraging both vision and force/torque as input modalities. Policies are conditioned on task primitive and object identity (object ID).
- Reported Results: In insertion, force/torque input is crucial (RGB+D+Ï„: 11/25 vs. RGB-only: 2/25). Conditioning policies by object ID improves shape-wise generalization (e.g., Transformer: 27/45 vs. ResNet: 14/45) (Luo et al., 2024).
- Hierarchical vs. Flat Policies: End-to-end ("flat") policies fail consistently on multi-stage tasks (0/10), while hierarchical pipelines—where primitives are composed based on labeled transitions—exhibit superior robustness (up to 19/30 on single-object, 7/10 on multi-object tasks).
7. Reproducibility and Extension
All FMB resources are open-sourced:
- CAD and Assembly: Downloadable parts for all objects and boards, along with assembly and calibration instructions.
- ROS Controller Stack: Modular codebase supports demonstration capture, primitive segmentation, low-level impedance control, and evaluation protocols.
- Evaluation and Leaderboard: Scripts for generating success metrics, generalization gaps, and summary tables. Leaderboards can be generated from submitted logs for community-wide comparison.
- Simulation Adapters: Wrappers are available for MuJoCo, Drake, Bullet, and other simulation environments, ensuring alignment between real and simulated protocols (Luo et al., 2024, Cruciani et al., 2020).
FMB’s extensible software and data structures support rapid expansion to new tasks, hardware classes, perception pipelines, and manipulation modalities, preserving its utility as a foundation for comparative evaluation and ablation of robotic manipulation policies (Luo et al., 2024, Cruciani et al., 2020, Burns et al., 2024).