LabelFusion Pipeline for Robotic Vision

Updated 25 January 2026

LabelFusion is a robotic vision pipeline that generates ground-truth per-pixel segmentation masks and precise 6 DOF pose annotations from real RGB-D videos of cluttered scenes.
It integrates dense 3D reconstruction, efficient human-assisted ICP-based mesh alignment, and automated 2D label reprojection to create massive annotated datasets.
Empirical insights reveal that diverse viewpoints and cluttered scene training dramatically enhance segmentation accuracy, guiding optimal data collection strategies.

LabelFusion is a robotic vision pipeline for generating ground-truth, per-pixel object segmentation masks and 6 DOF pose annotations from real RGB-D video of cluttered scenes. It addresses the critical bottleneck of deep robotic perception: scalable, high-quality training data specific to manipulation tasks, which existing public datasets rarely provide. LabelFusion achieves rapid, high-throughput annotation by combining real RGB-D video acquisition, dense 3D reconstruction, efficient human-assisted object mesh alignment via ICP, and massive-scale 2D label generation through mesh reprojection. The pipeline enables the systematic study of segmentation network performance as a function of dataset structure, quantity, and diversity (Marion et al., 2017).

1. Pipeline Architecture and Workflow

LabelFusion comprises four primary stages:

RGB-D Video Capture: Scenes are captured at 30 Hz VGA RGB-D using sensors such as the Asus Xtion Pro, either hand-held (freeform) or robot-arm–mounted (scripted trajectories), with no hand–eye calibration required. Each 120 s recording yields approximately 3,600 frames.
Dense 3D Reconstruction: ElasticFusion, a real-time dense RGB-D SLAM system, produces a surfel-based point cloud $\mathcal{P}$ in a canonical reconstruction frame $\mathcal{F}_R$ . It computes per-frame camera poses $T_{\mathcal{F}_R\leftarrow C_t}\in SE(3)$ and fuses frames without needing fiducial markers. Robustness to textureless surfaces is maintained via adaptive viewpoint trajectories.
Human-Assisted ICP-based Mesh Alignment: For each object $j$ , a known CAD mesh $\mathcal{M}_j$ is aligned to the reconstruction using a 3-click initialization (correspondences between scene and mesh points, solved by SVD) followed by point-to-point ICP refinement on a cropped region. This yields precise 6 DOF object poses $T_{\mathcal{F}_R\leftarrow O_j}$ . The operation is highly efficient: typically 30 s per object per scene.
Reprojection to 2D Frames: For each frame and object, the pose $T_{C_t\leftarrow O_j}$ is computed, the mesh is projected into the frame using standard pinhole intrinsics, and a z-buffer is used to assign pixel-level object labels and depth. This automated rendering produces hundreds of thousands of labeled RGB-D images per dataset, each with full segmentation masks and accurate object pose annotations.

Pseudocode Summary

The overall process is encapsulated in the pseudocode below:

for each scene log:
    P, {T_R←C_t} = ElasticFusion(rgbd_frames)
    for each object j:
        load M_j
        s1,s2,s3 = user clicks on P
        m1,m2,m3 = user clicks on M_j
        T_init = landmark_align({m_k},{s_k})
        P_crop = {p ∈ P | dist(p, T_init(M_j)) < δ}
        T_j = ICP_refine(M_j, P_crop, T_init)
        store T_R←O_j = T_j
    for each frame t:
        for each object j:
            T_C_t←O_j = inverse(T_R←C_t) * T_R←O_j
        render {M_j, T_C_t←O_j} → (L_t, D_t)

2. Mathematical Formulation

ICP Objective

Alignment seeks the rigid transformation $(R,t)\in SE(3)$ minimizing the sum of squared distances between mesh points and nearest neighbors in the point cloud:

$E(R,t) = \sum_{i=1}^N \| R\,p_i + t - q_i \|^2$

This is solved by alternating nearest-neighbor assignments and SVD-based updates.

Camera Projection

Object mesh vertices $X$ in object frame are transformed and projected as:

$X_C = R_{C_t\leftarrow O_j}\,X + t_{C_t\leftarrow O_j}$

$u = f_x \frac{X_C.x}{X_C.z} + c_x,\quad v = f_y \frac{X_C.y}{X_C.z} + c_y,\quad d = X_C.z$

For each pixel, a z-buffer is used to resolve occlusions and assign object labels.

3. Dataset Composition and Statistics

LabelFusion generated a large-scale annotated dataset with the following characteristics:

Statistic	Value
Distinct Object CAD models	12
Single/double-object scenes	105
Scenes with ≥6 objects in clutter	33
Aligned object instances	339
Typical frames per scene	3,600
Total labeled frames	352,000
Total labeled object instances	>1,000,000
Time per scene for annotation	~30 s/object

4. Empirical Insights: Dataset Quantity and Segmentation Accuracy

Extensive experiments were performed using a DeepLab-style ResNet segmentation model to quantify the effects of training set structure:

Cluttered (multi-object) vs. Single-object Training: On multi-object test scenes, training with cluttered scenes yields ~190% higher IoU than the same number of single-object frames. Beyond 18 cluttered scenes, adding more single-object data offers negligible gains.
Background Diversity: For robust single-object segmentation in novel backgrounds, approximately 50 background variations are needed to surpass 50% mean IoU.
View Sampling: For robot-arm scans (slow motion), annotation returns rapidly saturate beyond ~0.3 Hz. For hand-carried, faster motions, benefits accrue up to ~3 Hz. This empirically informs the optimal trade-off between frame capture rate and dataset diversity.

5. Critical Implementation Details

Key system-level and algorithmic choices ensure annotation fidelity and scalability:

3D Reconstruction: ElasticFusion, default parameters, GTX 1080 GPU; surfel resolution ≈ 1 cm.
ICP Parameters: 3-click crop threshold δ = 1 cm; 20 iterations per object alignment; convergence tolerance 1e−5.
Mesh Generation: Source meshes from Artec Space Spider, Next Engine, YCB dataset, or VTK tools.
Annotation UI: Built in “Director” framework—streamlines human input for mesh alignment.
Rendering: OpenGL z-buffer or software rasterization; supports both CPU and GPU parallelism.
Hardware: Multi-desktop (Intel i7 + GTX 900/1000-series) setups permit real-time throughput matching capture rates.

6. Practical Guidelines and Limitations

LabelFusion enables rigorous, quantitative dataset construction and analysis for training deep segmentation and pose architectures:

Annotation Efficiency: A few minutes of video with 30 s per object annotation yields hundreds of thousands of labeled frames.
Data Collection Strategy: Empirical calibration curves enable prediction of how many scenes, backgrounds, and viewpoints are required to achieve prescribed accuracy.
Limiting Factors: Accuracy sensitive to the diversity of backgrounds and object arrangements. 3D mesh fidelity and careful camera trajectory design are important for robust reconstruction and label transfer. Large homogeneous or textureless surfaces may require slower camera sweeps for tracking stability.

LabelFusion has established itself as a representative pipeline in robotic vision data annotation. Its open-source release and published benchmarks have served as a reference point for subsequent methods in RGB-D scene annotation and analysis (Marion et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

LabelFusion: A Pipeline for Generating Ground Truth Labels for Real RGBD Data of Cluttered Scenes (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LabelFusion Pipeline.