Papers
Topics
Authors
Recent
Search
2000 character limit reached

OVE6D: Universal 6D Pose Estimation

Updated 27 December 2025
  • OVE6D is a universal framework for 6D object pose estimation that decomposes pose into out-of-plane rotation, in-plane rotation, and translation.
  • It leverages a modular architecture with lightweight, specialized heads and a viewpoint codebook to achieve robust performance on challenging benchmarks.
  • The design uses fewer than 4 million parameters and synthetic data to efficiently handle object symmetries and occlusions while ensuring state-of-the-art accuracy.

OVE6D is a universal deep learning-based framework for model-based 6D object pose estimation from a single depth image and a target object mask. The method is notable for its decomposition of the 6D pose into viewpoint (out-of-plane rotation), in-plane rotation about the camera optical axis, and translation, and its modular architecture comprising lightweight specialized heads. OVE6D is trained solely on synthetic data rendered from the ShapeNet repository and, unlike most existing approaches, achieves robust generalization to unseen real-world objects without the need for dataset-specific fine-tuning. The core architecture contains fewer than 4 million parameters and utilizes a viewpoint codebook enabling efficient pose hypothesis retrieval and refinement. The method attains state-of-the-art results on challenging benchmarks such as T-LESS and Occluded LINEMOD using only synthetic training data, and demonstrates strong robustness to object symmetries and occlusions (Cai et al., 2022).

1. Architecture Overview

The OVE6D architecture operates on pre-processed 128×128 depth crops with an applied object mask, a known object ID, and a mesh model available in a codebook. The central feature extractor is a shared 2D convolutional backbone whose output is a feature map z∈RC×H×Wz \in \mathbb{R}^{C \times H \times W}, where typically C=128C=128 and H=W=8H=W=8. From this shared backbone, three lightweight heads branch out:

  • Object Viewpoint Encoder (OVE): Retrieves the out-of-plane rotation, RγR_\gamma, via embedding matching with the precomputed codebook.
  • In-Plane Orientation Regression (IOR): Receives both the real instance feature zz and the synthetic codebook feature zγz_\gamma corresponding to RγR_\gamma, and regresses the in-plane rotation Rθ∈SO(2)R_\theta \in SO(2).
  • Orientation Consistency Verification (OCV): Scores each hypothesized rotation Rest=Rθ⋅RγR_{\text{est}}=R_\theta \cdot R_\gamma, ranking candidates for final selection.

Translation is estimated separately, initially by computing the mask bounding-box center and median depth, and subsequently refined using the centroid shift between the rendered synthetic view and the observed depth crop. Final pose hypothesis selection utilizes a depth-difference ratio metric (qpq_p, Eq. 9) and, optionally, an ICP refinement.

2. Layer-by-Layer Specification

The layerwise structure adheres strictly to a modular design:

A. Shared Backbone

  • Convolutional stages with 3×3 kernels, batch normalization, and ReLU activations.
  • Strided convolutions progressively downsample spatial dimensions while increasing channel count.
  • Skip connections connect each pair of layers sharing the same spatial resolution for residual learning.
  • Approximate backbone parameter count: 2.5 million.

B. Object Viewpoint Encoder (OVE) Head

  • Receives zz; applies a 3×33\times3 conv (128→64), global average pooling, and a linear layer yielding a 64D code.
  • Output serves as the embedding for codebook retrieval.
  • Parameter count: ≈0.05 million.

C. In-Plane Orientation Regression (IOR) Head

  • Stacks real and synthetic features along the channel axis.
  • Applies a 3×33\times3 conv (256→64), flattening, fully-connected layers (output: 2D vector Θ=(Ï‘1,Ï‘2Θ=(Ï‘_1, Ï‘_2)).
  • Constructs RθR_\theta as per Eq. 12.
  • Parameter count: ≈0.3 million.

D. Orientation Consistency Verification (OCV) Head

  • Receives stacked real and rotated synthetic features (256 channels).
  • 3×33\times3 conv to 64, then to 16 channels, followed by global average pooling and a scalar output.
  • Parameter count: ≈0.1 million.

E. Translation Module

  • No learned parameters: initial translation from mask/depth statistics, refinement via synthetic centroid registration.

Total learnable parameters: ≈3.45 million.

3. Mathematical Formulation and Loss Functions

The pose R∈SO(3)R \in SO(3) is represented as a composition R=Rθ⋅RγR = R_\theta \cdot R_\gamma:

  • RγR_\gamma: out-of-plane rotation (retrieved from codebook).
  • RθR_\theta: in-plane rotation about the camera zz-axis (regressed by IOR).

Losses:

  • Viewpoint Ranking Loss (â„“vp\ell^{vp}, Eq. 1): Hinge loss on cosine similarity among triplet embeddings (v,vθ,vγ)(v, v_\theta, v_\gamma), enforcing invariance to in-plane rotation and distinctiveness to out-of-plane change. Margin mvp=0.1m^{vp}=0.1.
  • In-Plane Regression Loss (ℓθ\ell^{\theta}, Eq. 2): Negative log of cosine similarity between in-plane rotated features encouraging accurate regression.
  • Consistency Verification Loss (â„“css\ell^{css}, Eq. 3): Hinge loss to ensure higher consistency scores for correct pose pairs. Margin mcss=0.1m^{css}=0.1.
  • Total Loss (Eq. 6): L=1bs∑i[λ1â„“vp+λ2â„“css+λ3ℓθ]L = \frac{1}{bs} \sum_i [ \lambda_1 \ell^{vp} + \lambda_2 \ell^{css} + \lambda_3 \ell^{\theta} ] with λ1=100\lambda_1=100, λ2=10\lambda_2=10, λ3=1\lambda_3=1.

4. Handling Object Symmetries and Ambiguities

OVE6D encodes invariances and resolves ambiguities as follows:

  • The viewpoint embedding v(â‹…)v(\cdot) is explicitly invariant to all possible in-plane rotations RθR_\theta, enforced via the triplet ranking loss, causing all in-plane rotated views to project to a single 64-dimensional code.
  • Symmetric objects with multiple indistinguishable out-of-plane orientations have those orientations occupy separate positions in the codebook. At inference, retrieving top-KK candidates covers symmetry-equivalent poses.
  • IOR and OCV heads then resolve in-plane ambiguities and validate the predicted pose with the actual depth crop.
  • For fully symmetric objects, multiple viewpoint candidates receive identical scores; the final qpq_p selection rule and, optionally, ICP refinement resolve residual ties.

5. Training Regimen and Data Augmentation

OVE6D is trained exclusively on synthetic data rendered from approximately 19,000 ShapeNet meshes:

  • For each object per batch: 16 anchor viewpoints Ri∈SO(3)R_i \in SO(3) are sampled uniformly over the upper hemisphere.
  • For each anchor: generate triplets VV, Vθ=Rθ⋅VV_\theta=R_\theta \cdot V, Vγ=Rγ⋅VV_\gamma=R_\gamma \cdot V with randomly selected in-plane/out-of-plane rotations.
  • Each batch: 128 samples (8 objects × 16 triplets).
  • Depth rendering utilizes Pyrender for noise-free depth images, constructing both the training data and the viewpoint codebook.
  • Data augmentation includes random rescale (0.2–0.8), Laplace noise (σ∈[0,0.01]\sigma \in [0,0.01]), random cutout (1–10% area), Gaussian blur (σ∈[0,1.5]\sigma \in [0,1.5]), and random occlusions covering 20% of the area.
  • Optimization: Adam optimizer, cosine annealing learning rate from 1e−31\mathrm{e}{-3} to 1e−51\mathrm{e}{-5}, weight decay 1e−51\mathrm{e}{-5}, total 50 epochs (∼3 days on an RTX3090).
  • No real images are used for training. The method generalizes in zero-shot to real datasets such as LINEMOD, LMO, and T-LESS.

6. Parameter Budget, Inference, and Benchmark Performance

OVE6D achieves high computational efficiency and performance:

  • Total learnable parameter count is approximately 3.5 million (<4 million).
  • Inference time is approximately 50 ms per object on Nvidia RTX3090 plus AMD Ryzen 3970X CPU (ICP refinement not included).
  • Codebook generation from arbitrary meshes requires about 30 s each.

Benchmark results on real datasets demonstrate strong performance, both with ground-truth (GT) masks and masks from deep segmentation models (Mask-RCNN). Notable recall rates are achieved on T-LESS, Occluded LINEMOD, and LINEMOD datasets, as shown below:

Dataset Input Mask Source ICP Metric Score
T-LESS D GT no VSD recall 85.1%
T-LESS D GT yes VSD recall 91.0%
T-LESS D Mask-RCNN no VSD recall 69.4%
T-LESS D Mask-RCNN yes VSD recall 74.8%
Occluded LM D GT no ADD(-S) recall 70.9%
Occluded LM D GT yes ADD(-S) recall 82.5%
Occluded LM D Mask-RCNN no ADD(-S) recall 56.1%
Occluded LM D Mask-RCNN yes ADD(-S) recall 72.8%
LINEMOD D GT no ADD(-S) recall 96.4%
LINEMOD D GT yes ADD(-S) recall 98.7%
LINEMOD D Mask-RCNN no ADD(-S) recall 86.1%
LINEMOD D Mask-RCNN yes ADD(-S) recall 92.4%

The architecture enables practitioners to re-implement OVE6D, construct viewpoint codebooks for arbitrary meshes, and replicate the cited experimental results or extend the network for further research (Cai et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OVE6D Architecture.