Learning Canonical Shape Space for Category-Level 6D Object Pose and Size Estimation

Published 25 Jan 2020 in cs.CV | (2001.09322v3)

Abstract: We present a novel approach to category-level 6D object pose and size estimation. To tackle intra-class shape variations, we learn canonical shape space (CASS), a unified representation for a large variety of instances of a certain object category. In particular, CASS is modeled as the latent space of a deep generative model of canonical 3D shapes with normalized pose. We train a variational auto-encoder (VAE) for generating 3D point clouds in the canonical space from an RGBD image. The VAE is trained in a cross-category fashion, exploiting the publicly available large 3D shape repositories. Since the 3D point cloud is generated in normalized pose (with actual size), the encoder of the VAE learns view-factorized RGBD embedding. It maps an RGBD image in arbitrary view into a pose-independent 3D shape representation. Object pose is then estimated via contrasting it with a pose-dependent feature of the input RGBD extracted with a separate deep neural networks. We integrate the learning of CASS and pose and size estimation into an end-to-end trainable network, achieving the state-of-the-art performance.

Abstract PDF Upgrade to Chat

Citations (167)

View on Semantic Scholar

Summary

The paper introduces Canonical Shape Space (CASS) and a deep variational auto-encoder to estimate category-level 6D object pose and size from RGBD images.
Evaluations show this approach achieves state-of-the-art performance, outperforming prior methods like NOCS Framework in 6D pose accuracy on public datasets.
This method addresses challenges like intra-class variance and lack of CAD models, enabling better object manipulation in robotics and advancing pose estimation from visual data.

Category-Level 6D Object Pose and Size Estimation: Learning Canonical Shape Space

The paper "Learning Canonical Shape Space for Category-Level 6D Object Pose and Size Estimation" by Dengsheng Chen et al. presents a sophisticated approach for estimating 6D object pose and size at the category level. This research tackles the challenge of intra-class shape variations by introducing a novel unified representation termed Canonical Shape Space (CASS), which serves as the latent space in a deep generative model for canonical 3D shapes.

Key Contributions and Methodology

The primary contribution of this study is the development of a deep variational auto-encoder (VAE) for generating 3D point clouds from RGBD images, thereby facilitating the estimation of object pose and size. The VAE is trained across multiple categories, utilizing extensive 3D shape repositories that are publicly available. This cross-category training strategy exploits the vast shape and pose variations present in real-world data without the need for instance-specific CAD models.

The encoder in the VAE achieves view-factorization by transforming an RGBD image, captured from any arbitrary viewpoint, into a pose-independent 3D shape representation. Object pose is determined by contrasting this transformation with a pose-dependent feature derived from input RGBD data via separate deep neural networks.

The integration of CASS learning and the pose and size estimation process into a single end-to-end trainable network demonstrates state-of-the-art performance. This integration addresses two major challenges in category-level 6D object pose estimation: significant intra-class variance and the absence of precise CAD models for the target objects.

Numerical Results and Comparison

Evaluations on public datasets reveal that this approach outperforms existing methods like the NOCS framework, especially in metrics that assess precision in pose estimation, such as the 5° 5cm metric. The quantitative results consistently show superior accuracy in pose estimation compared to baseline approaches, although some room for improvement remains in size calculation, notably without post-processing techniques such as Iterative Closest Point (ICP) refinement.

Implications and Future Directions

The research implications are multifaceted. Practically, it paves the way for improvements in object manipulation and navigation in robotics by enabling robots to interact with a diverse set of objects without requiring exact CAD models. Theoretically, it presents a paradigm shift in how pose information is encoded and estimated from visual data, particularly in leveraging generative models.

Future research directions could expand on several fronts. One potential avenue is enhancing the model's ability to handle objects with very complex or high-genus geometries, possibly through volumetric representation techniques. Additionally, incorporating reconstructed shape geometry into the feedback loop for pose estimation could yield an unsupervised or self-supervised learning framework. Further developments might also focus on real-time applications, possibly extending this framework to online tracking and pose estimation in dynamic environments.

Overall, this paper makes a significant contribution to object pose estimation literature, signaling further exploration in multi-category and real-time applications in AI and robotics.