YCB-Video Dataset for 6D Pose Estimation

Updated 14 January 2026

YCB-Video Dataset is a large-scale, annotated video dataset featuring RGB and depth data with precise 6D pose labels for evaluating object pose estimation in cluttered environments.
It comprises 92 video sequences with over 133,000 frames, using SDF fitting and global optimization to deliver high-accuracy ground truth across multiple objects per frame.
The dataset supports both RGB and RGB-D methods, providing benchmark metrics such as ADD, ADD-S, and AUC, and has driven advances in pose estimation methodologies.

The YCB-Video Dataset is a large-scale, richly annotated video dataset designed for evaluating 6D object pose estimation algorithms in cluttered, real-world settings. Introduced alongside the PoseCNN framework, this dataset consists of color and depth video sequences containing objects from the YCB object set, with each frame precisely labeled with ground truth 6D poses (3D translation and 3D rotation) for multiple objects per scene. The YCB-Video Dataset has become a standard benchmark for 6D pose estimation, supporting advances in RGB- and RGB-D-based methods, and has been used extensively to demonstrate and compare the performance of state-of-the-art systems, notably PoseCNN and its successors (Xiang et al., 2017, Periyasamy et al., 2022).

1. Dataset Composition and Annotation Protocol

The YCB-Video Dataset comprises 21 objects selected from the YCB set for model quality and depth coverage. Object categories range from textured household items to geometric primitives.

Videos: 92 total sequences, with 80 for training and 12 held out for testing.
Frames: 133,827 total, with 3–9 objects per frame (mean 4.47). The test split contains 2,949 key frames sampled from the test videos.
Annotation Workflow:
- Initial pose alignment in each sequence is performed on the first frame using Signed-Distance Field (SDF) fitting in depth images.
- Objects remain fixed while the camera moves, enabling accurate tracking of camera poses throughout each sequence.
- Final annotation relies on global optimization over both camera trajectory and object transforms, ensuring high-precision ground truth.
Data Modalities: Each frame provides RGB and depth, with per-pixel object masks and object class labels.

This precise annotation strategy yields ground truth 6D poses for real objects, supporting both RGB and RGB-D algorithm evaluation (Xiang et al., 2017).

2. Benchmarking Protocols and Splits

The standard benchmark protocol involves training on the 80 training videos (with 3–9 objects per frame) and evaluating on the 2,949 key frames from the 12 held-out test videos. This protocol ensures that:

All methods are assessed on the same subset of temporally diverse, cluttered scenes.
The benchmark includes a variety of challenges, including occlusions and interactions between multiple objects.

Common evaluation metrics applied on the YCB-Video Dataset include:

ADD (Average Model Point Distance): $\mathrm{ADD} = \frac{1}{m} \sum_{x \in \mathcal{M}} \| (R x + T) - (\hat{R} x + \hat{T}) \|$
ADD-S: Closest-point variant for symmetric objects.
Reprojection Error: For methods evaluated using RGB only.
AUC (Area Under Curve): Plotted over threshold-swept accuracy, commonly reported for both ADD and ADD-S.

These rigorous benchmark protocols make YCB-Video the definitive dataset for evaluating real-scene 6D pose estimation algorithms (Xiang et al., 2017, Periyasamy et al., 2022).

3. Role in Pose Estimation Research

The YCB-Video Dataset was introduced to support the development of PoseCNN, an end-to-end convolutional network for 6D pose estimation in cluttered scenes, but its influence extends broadly:

Decoupled Translation and Rotation: Methods such as PoseCNN employ center localization and depth regression for translation, and quaternion regression for rotation, leveraging the dense and diverse annotations of YCB-Video for robust training and benchmarking.
Handling Occlusion: The dataset's multiple-instance, occlusion-rich scenes have fostered architectural advances such as Hough-voting center localization and robust orientation heads (Xiang et al., 2017).
Symmetry Handling: Ground truth 6D poses for symmetric and near-symmetric objects enable effective evaluation of symmetry-aware loss functions, such as ShapeMatch-Loss (SLoss) and pose clustering strategies (Periyasamy et al., 2022).
ICP Refinement and RGB-D Benchmarks: The joint RGB and depth data support protocols that assess both color-only and RGB-D methods, including post-hoc ICP refinement.

These attributes have made YCB-Video the foundation for numerous algorithms targeting real-world 6D object pose estimation (Xiang et al., 2017, Periyasamy et al., 2022).

4. Impact on Methodological Advances

The comprehensive annotations and challenging scenarios of YCB-Video have directly enabled and motivated key innovations:

Fully Convolutional Approaches: ConvPoseCNN 2 and related architectures leverage the dense spatial annotations by predicting dense pose fields, eschewing region-of-interest pooling for sliding-window inference (Periyasamy et al., 2022).
Aggregation Strategies for Orientation: The dataset supports studies of different quaternion aggregation methods (e.g., Markley’s weighted average, RANSAC clustering) to handle per-pixel pose hypotheses in dense settings.
Iterative Refinement Modules: Advances such as residual-style feature refinement blocks evaluated on YCB-Video have enabled iterative improvement in pose estimates, with documented AUC gains for challenging classes and scenarios.

In summary, by combining scale, accuracy, and scenario diversity, YCB-Video continues to be a driving force for architectural and loss function design in pose estimation research (Periyasamy et al., 2022).

5. Quantitative Benchmarks and Comparative Results

Methods evaluated on the YCB-Video Dataset report comprehensive metrics, typically in the form of per-class and mean AUC for ADD and ADD-S, as well as translation error and segmentation IoU. Representative results from PoseCNN and ConvPoseCNN 2 include:

Method	6D AUC P	6D AUC S	Rotation AUC P	Rotation AUC S	Trans. err (m)	Segm. IoU
PoseCNN	53.71	76.12	78.87	93.16	0.0520	0.8369
ConvPoseCNN 2	57.42	79.26	74.53	91.56	0.0411	0.8044

Evaluations demonstrate that PoseCNN achieves state-of-the-art results for color-only input, with further gains via ICP-based RGB-D refinement (20–30% AUC increases). ConvPoseCNN 2 yields slightly improved AUC, substantially faster training, and parameter efficiency, highlighting the dataset’s critical role in advancing 6D object pose estimation (Xiang et al., 2017, Periyasamy et al., 2022).

6. Limitations and Points of Ongoing Research

Despite its widespread use, the YCB-Video Dataset has certain limitations:

Annotation Procedure: All object poses in a video are initialized via SDF fitting and tracked with fixed objects and a moving camera. While this ensures consistency, complex interactions or significant motion between frames are not captured.
Domain Specificity: The objects, settings, and background contexts are sampled from a specific subset of environments (e.g., tabletop scenes), which could limit generalization to highly dynamic or non-standard settings.
Symmetry in Ground Truth: Some object symmetries are only approximately handled, requiring algorithmic awareness (e.g., explicit symmetric loss terms or clustering).

A plausible implication is that methods overfitted to YCB-Video test sets may not fully generalize to more diverse real-world pose estimation scenarios. Ongoing research leverages the dataset for ablation studies on loss functions, center regression strategies, and ICP initialization performance, continually shaping the next generation of robust pose estimation pipelines (Xiang et al., 2017, Periyasamy et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes (2017)

ConvPoseCNN2: Prediction and Refinement of Dense 6D Object Poses (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YCB-Video Dataset.