LabelFusion Pipeline for Robotic Vision
- LabelFusion is a robotic vision pipeline that generates ground-truth per-pixel segmentation masks and precise 6 DOF pose annotations from real RGB-D videos of cluttered scenes.
- It integrates dense 3D reconstruction, efficient human-assisted ICP-based mesh alignment, and automated 2D label reprojection to create massive annotated datasets.
- Empirical insights reveal that diverse viewpoints and cluttered scene training dramatically enhance segmentation accuracy, guiding optimal data collection strategies.
LabelFusion is a robotic vision pipeline for generating ground-truth, per-pixel object segmentation masks and 6 DOF pose annotations from real RGB-D video of cluttered scenes. It addresses the critical bottleneck of deep robotic perception: scalable, high-quality training data specific to manipulation tasks, which existing public datasets rarely provide. LabelFusion achieves rapid, high-throughput annotation by combining real RGB-D video acquisition, dense 3D reconstruction, efficient human-assisted object mesh alignment via ICP, and massive-scale 2D label generation through mesh reprojection. The pipeline enables the systematic study of segmentation network performance as a function of dataset structure, quantity, and diversity (Marion et al., 2017).
1. Pipeline Architecture and Workflow
LabelFusion comprises four primary stages:
- RGB-D Video Capture: Scenes are captured at 30 Hz VGA RGB-D using sensors such as the Asus Xtion Pro, either hand-held (freeform) or robot-arm–mounted (scripted trajectories), with no hand–eye calibration required. Each 120 s recording yields approximately 3,600 frames.
- Dense 3D Reconstruction: ElasticFusion, a real-time dense RGB-D SLAM system, produces a surfel-based point cloud in a canonical reconstruction frame . It computes per-frame camera poses and fuses frames without needing fiducial markers. Robustness to textureless surfaces is maintained via adaptive viewpoint trajectories.
- Human-Assisted ICP-based Mesh Alignment: For each object , a known CAD mesh is aligned to the reconstruction using a 3-click initialization (correspondences between scene and mesh points, solved by SVD) followed by point-to-point ICP refinement on a cropped region. This yields precise 6 DOF object poses . The operation is highly efficient: typically 30 s per object per scene.
- Reprojection to 2D Frames: For each frame and object, the pose is computed, the mesh is projected into the frame using standard pinhole intrinsics, and a z-buffer is used to assign pixel-level object labels and depth. This automated rendering produces hundreds of thousands of labeled RGB-D images per dataset, each with full segmentation masks and accurate object pose annotations.
Pseudocode Summary
The overall process is encapsulated in the pseudocode below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
for each scene log: P, {T_R←C_t} = ElasticFusion(rgbd_frames) for each object j: load M_j s1,s2,s3 = user clicks on P m1,m2,m3 = user clicks on M_j T_init = landmark_align({m_k},{s_k}) P_crop = {p ∈ P | dist(p, T_init(M_j)) < δ} T_j = ICP_refine(M_j, P_crop, T_init) store T_R←O_j = T_j for each frame t: for each object j: T_C_t←O_j = inverse(T_R←C_t) * T_R←O_j render {M_j, T_C_t←O_j} → (L_t, D_t) |
2. Mathematical Formulation
ICP Objective
Alignment seeks the rigid transformation minimizing the sum of squared distances between mesh points and nearest neighbors in the point cloud:
This is solved by alternating nearest-neighbor assignments and SVD-based updates.
Camera Projection
Object mesh vertices in object frame are transformed and projected as:
For each pixel, a z-buffer is used to resolve occlusions and assign object labels.
3. Dataset Composition and Statistics
LabelFusion generated a large-scale annotated dataset with the following characteristics:
| Statistic | Value |
|---|---|
| Distinct Object CAD models | 12 |
| Single/double-object scenes | 105 |
| Scenes with ≥6 objects in clutter | 33 |
| Aligned object instances | 339 |
| Typical frames per scene | 3,600 |
| Total labeled frames | 352,000 |
| Total labeled object instances | >1,000,000 |
| Time per scene for annotation | ~30 s/object |
4. Empirical Insights: Dataset Quantity and Segmentation Accuracy
Extensive experiments were performed using a DeepLab-style ResNet segmentation model to quantify the effects of training set structure:
- Cluttered (multi-object) vs. Single-object Training: On multi-object test scenes, training with cluttered scenes yields ~190% higher IoU than the same number of single-object frames. Beyond 18 cluttered scenes, adding more single-object data offers negligible gains.
- Background Diversity: For robust single-object segmentation in novel backgrounds, approximately 50 background variations are needed to surpass 50% mean IoU.
- View Sampling: For robot-arm scans (slow motion), annotation returns rapidly saturate beyond ~0.3 Hz. For hand-carried, faster motions, benefits accrue up to ~3 Hz. This empirically informs the optimal trade-off between frame capture rate and dataset diversity.
5. Critical Implementation Details
Key system-level and algorithmic choices ensure annotation fidelity and scalability:
- 3D Reconstruction: ElasticFusion, default parameters, GTX 1080 GPU; surfel resolution ≈ 1 cm.
- ICP Parameters: 3-click crop threshold δ = 1 cm; 20 iterations per object alignment; convergence tolerance 1e−5.
- Mesh Generation: Source meshes from Artec Space Spider, Next Engine, YCB dataset, or VTK tools.
- Annotation UI: Built in “Director” framework—streamlines human input for mesh alignment.
- Rendering: OpenGL z-buffer or software rasterization; supports both CPU and GPU parallelism.
- Hardware: Multi-desktop (Intel i7 + GTX 900/1000-series) setups permit real-time throughput matching capture rates.
6. Practical Guidelines and Limitations
LabelFusion enables rigorous, quantitative dataset construction and analysis for training deep segmentation and pose architectures:
- Annotation Efficiency: A few minutes of video with 30 s per object annotation yields hundreds of thousands of labeled frames.
- Data Collection Strategy: Empirical calibration curves enable prediction of how many scenes, backgrounds, and viewpoints are required to achieve prescribed accuracy.
- Limiting Factors: Accuracy sensitive to the diversity of backgrounds and object arrangements. 3D mesh fidelity and careful camera trajectory design are important for robust reconstruction and label transfer. Large homogeneous or textureless surfaces may require slower camera sweeps for tracking stability.
LabelFusion has established itself as a representative pipeline in robotic vision data annotation. Its open-source release and published benchmarks have served as a reference point for subsequent methods in RGB-D scene annotation and analysis (Marion et al., 2017).