OVE6D: Universal 6D Pose Estimation
- OVE6D is a universal framework for 6D object pose estimation that decomposes pose into out-of-plane rotation, in-plane rotation, and translation.
- It leverages a modular architecture with lightweight, specialized heads and a viewpoint codebook to achieve robust performance on challenging benchmarks.
- The design uses fewer than 4 million parameters and synthetic data to efficiently handle object symmetries and occlusions while ensuring state-of-the-art accuracy.
OVE6D is a universal deep learning-based framework for model-based 6D object pose estimation from a single depth image and a target object mask. The method is notable for its decomposition of the 6D pose into viewpoint (out-of-plane rotation), in-plane rotation about the camera optical axis, and translation, and its modular architecture comprising lightweight specialized heads. OVE6D is trained solely on synthetic data rendered from the ShapeNet repository and, unlike most existing approaches, achieves robust generalization to unseen real-world objects without the need for dataset-specific fine-tuning. The core architecture contains fewer than 4 million parameters and utilizes a viewpoint codebook enabling efficient pose hypothesis retrieval and refinement. The method attains state-of-the-art results on challenging benchmarks such as T-LESS and Occluded LINEMOD using only synthetic training data, and demonstrates strong robustness to object symmetries and occlusions (Cai et al., 2022).
1. Architecture Overview
The OVE6D architecture operates on pre-processed 128×128 depth crops with an applied object mask, a known object ID, and a mesh model available in a codebook. The central feature extractor is a shared 2D convolutional backbone whose output is a feature map , where typically and . From this shared backbone, three lightweight heads branch out:
- Object Viewpoint Encoder (OVE): Retrieves the out-of-plane rotation, , via embedding matching with the precomputed codebook.
- In-Plane Orientation Regression (IOR): Receives both the real instance feature and the synthetic codebook feature corresponding to , and regresses the in-plane rotation .
- Orientation Consistency Verification (OCV): Scores each hypothesized rotation , ranking candidates for final selection.
Translation is estimated separately, initially by computing the mask bounding-box center and median depth, and subsequently refined using the centroid shift between the rendered synthetic view and the observed depth crop. Final pose hypothesis selection utilizes a depth-difference ratio metric (, Eq. 9) and, optionally, an ICP refinement.
2. Layer-by-Layer Specification
The layerwise structure adheres strictly to a modular design:
A. Shared Backbone
- Convolutional stages with 3×3 kernels, batch normalization, and ReLU activations.
- Strided convolutions progressively downsample spatial dimensions while increasing channel count.
- Skip connections connect each pair of layers sharing the same spatial resolution for residual learning.
- Approximate backbone parameter count: 2.5 million.
B. Object Viewpoint Encoder (OVE) Head
- Receives ; applies a conv (128→64), global average pooling, and a linear layer yielding a 64D code.
- Output serves as the embedding for codebook retrieval.
- Parameter count: ≈0.05 million.
C. In-Plane Orientation Regression (IOR) Head
- Stacks real and synthetic features along the channel axis.
- Applies a conv (256→64), flattening, fully-connected layers (output: 2D vector )).
- Constructs as per Eq. 12.
- Parameter count: ≈0.3 million.
D. Orientation Consistency Verification (OCV) Head
- Receives stacked real and rotated synthetic features (256 channels).
- conv to 64, then to 16 channels, followed by global average pooling and a scalar output.
- Parameter count: ≈0.1 million.
E. Translation Module
- No learned parameters: initial translation from mask/depth statistics, refinement via synthetic centroid registration.
Total learnable parameters: ≈3.45 million.
3. Mathematical Formulation and Loss Functions
The pose is represented as a composition :
- : out-of-plane rotation (retrieved from codebook).
- : in-plane rotation about the camera -axis (regressed by IOR).
Losses:
- Viewpoint Ranking Loss (, Eq. 1): Hinge loss on cosine similarity among triplet embeddings , enforcing invariance to in-plane rotation and distinctiveness to out-of-plane change. Margin .
- In-Plane Regression Loss (, Eq. 2): Negative log of cosine similarity between in-plane rotated features encouraging accurate regression.
- Consistency Verification Loss (, Eq. 3): Hinge loss to ensure higher consistency scores for correct pose pairs. Margin .
- Total Loss (Eq. 6): with , , .
4. Handling Object Symmetries and Ambiguities
OVE6D encodes invariances and resolves ambiguities as follows:
- The viewpoint embedding is explicitly invariant to all possible in-plane rotations , enforced via the triplet ranking loss, causing all in-plane rotated views to project to a single 64-dimensional code.
- Symmetric objects with multiple indistinguishable out-of-plane orientations have those orientations occupy separate positions in the codebook. At inference, retrieving top- candidates covers symmetry-equivalent poses.
- IOR and OCV heads then resolve in-plane ambiguities and validate the predicted pose with the actual depth crop.
- For fully symmetric objects, multiple viewpoint candidates receive identical scores; the final selection rule and, optionally, ICP refinement resolve residual ties.
5. Training Regimen and Data Augmentation
OVE6D is trained exclusively on synthetic data rendered from approximately 19,000 ShapeNet meshes:
- For each object per batch: 16 anchor viewpoints are sampled uniformly over the upper hemisphere.
- For each anchor: generate triplets , , with randomly selected in-plane/out-of-plane rotations.
- Each batch: 128 samples (8 objects × 16 triplets).
- Depth rendering utilizes Pyrender for noise-free depth images, constructing both the training data and the viewpoint codebook.
- Data augmentation includes random rescale (0.2–0.8), Laplace noise (), random cutout (1–10% area), Gaussian blur (), and random occlusions covering 20% of the area.
- Optimization: Adam optimizer, cosine annealing learning rate from to , weight decay , total 50 epochs (∼3 days on an RTX3090).
- No real images are used for training. The method generalizes in zero-shot to real datasets such as LINEMOD, LMO, and T-LESS.
6. Parameter Budget, Inference, and Benchmark Performance
OVE6D achieves high computational efficiency and performance:
- Total learnable parameter count is approximately 3.5 million (<4 million).
- Inference time is approximately 50 ms per object on Nvidia RTX3090 plus AMD Ryzen 3970X CPU (ICP refinement not included).
- Codebook generation from arbitrary meshes requires about 30 s each.
Benchmark results on real datasets demonstrate strong performance, both with ground-truth (GT) masks and masks from deep segmentation models (Mask-RCNN). Notable recall rates are achieved on T-LESS, Occluded LINEMOD, and LINEMOD datasets, as shown below:
| Dataset | Input | Mask Source | ICP | Metric | Score |
|---|---|---|---|---|---|
| T-LESS | D | GT | no | VSD recall | 85.1% |
| T-LESS | D | GT | yes | VSD recall | 91.0% |
| T-LESS | D | Mask-RCNN | no | VSD recall | 69.4% |
| T-LESS | D | Mask-RCNN | yes | VSD recall | 74.8% |
| Occluded LM | D | GT | no | ADD(-S) recall | 70.9% |
| Occluded LM | D | GT | yes | ADD(-S) recall | 82.5% |
| Occluded LM | D | Mask-RCNN | no | ADD(-S) recall | 56.1% |
| Occluded LM | D | Mask-RCNN | yes | ADD(-S) recall | 72.8% |
| LINEMOD | D | GT | no | ADD(-S) recall | 96.4% |
| LINEMOD | D | GT | yes | ADD(-S) recall | 98.7% |
| LINEMOD | D | Mask-RCNN | no | ADD(-S) recall | 86.1% |
| LINEMOD | D | Mask-RCNN | yes | ADD(-S) recall | 92.4% |
The architecture enables practitioners to re-implement OVE6D, construct viewpoint codebooks for arbitrary meshes, and replicate the cited experimental results or extend the network for further research (Cai et al., 2022).