End-to-End Learnable Geometric Vision by Backpropagating PnP Optimization

Published 13 Sep 2019 in cs.CV | (1909.06043v3)

Abstract: Deep networks excel in learning patterns from large amounts of data. On the other hand, many geometric vision tasks are specified as optimization problems. To seamlessly combine deep learning and geometric vision, it is vital to perform learning and geometric optimization end-to-end. Towards this aim, we present BPnP, a novel network module that backpropagates gradients through a Perspective-n-Points (PnP) solver to guide parameter updates of a neural network. Based on implicit differentiation, we show that the gradients of a "self-contained" PnP solver can be derived accurately and efficiently, as if the optimizer block were a differentiable function. We validate BPnP by incorporating it in a deep model that can learn camera intrinsics, camera extrinsics (poses) and 3D structure from training datasets. Further, we develop an end-to-end trainable pipeline for object pose estimation, which achieves greater accuracy by combining feature-based heatmap losses with 2D-3D reprojection errors. Since our approach can be extended to other optimization problems, our work helps to pave the way to perform learnable geometric vision in a principled manner. Our PyTorch implementation of BPnP is available on http://github.com/BoChenYS/BPnP.

Abstract PDF Upgrade to Chat

Citations (91)

View on Semantic Scholar

Summary

The paper introduces BPnP, a differentiable PnP solver that integrates geometric optimization into deep networks via implicit differentiation.
It enables end-to-end training for tasks such as pose estimation, Structure-from-Motion, and camera calibration, achieving state-of-the-art results on the LINEMOD dataset.
By backpropagating through the PnP module, BPnP aligns feature learning with geometric constraints, enhancing robustness and precision in vision tasks.

This paper introduces BPnP, a novel network module that enables end-to-end learning in geometric vision tasks by allowing gradients to be backpropagated through a Perspective-n-Points (PnP) solver. The core idea is to treat the PnP solver, which is traditionally an optimization block, as a differentiable layer within a deep neural network. This is achieved using the Implicit Function Theorem (IFT), which allows the computation of the Jacobian of the PnP solver's output (camera pose) with respect to its inputs (2D image points, 3D world points, camera intrinsics) without needing an explicit analytical form of the PnP solver's derivative.

Key Contributions:

Differentiable PnP Solver (BPnP): A network module that incorporates a PnP solver and accurately computes its gradients using implicit differentiation. This allows the PnP optimization process to guide the learning of preceding neural network layers.
End-to-End Learning for Geometric Tasks: Demonstrates the application of BPnP to various geometric vision problems, including pose estimation, Structure-from-Motion (SfM), and camera calibration, enabling them to be trained end-to-end.
Improved Object Pose Estimation: Develops an end-to-end trainable pipeline for 6DoF object pose estimation that combines feature-based heatmap losses with 2D-3D reprojection errors from BPnP, achieving state-of-the-art results on the LINEMOD dataset.

How BPnP Works:

The PnP problem aims to find the 6DoF camera pose $\bm y$ (rotation and translation) given $n$ 2D-3D correspondences $(\bm x_i, \bm z_i)$ and camera intrinsics $K$ . This is typically formulated as minimizing the sum of squared reprojection errors:

$\bm y = \argmin_{\bm y \in SE(3)} \sum^{n}_{i=1} \left\| \bm x_i - \pi(\bm z_i|\bm y, K) \right\|_2^2$

where $\pi(\bm z_i|\bm y, K)$ is the projection of the 3D point $\bm z_i$ onto the image plane.

To make this optimization differentiable, BPnP uses the Implicit Function Theorem.

Constraint Function: A constraint function $f(\bm a, \bm b) = \bm 0$ is needed, where $\bm b = \bm y$ (the pose) and $\bm a$ can be $\bm x$ (2D points), $\bm z$ (3D points), or $K$ (intrinsics). This function is derived from the stationary condition of the PnP optimization objective $o(\bm x, \bm y, \bm z, K)$ :

$f(\bm x, \bm y, \bm z, K) = \left. \frac{\partial o(\bm x, \bm y, \bm z, K)}{\partial \bm y} \right\rvert_{\bm y = g(\bm x, \bm z, K)} = \bm 0$

where $g$ represents the PnP solver.
Implicit Differentiation: The IFT states that if $f(\bm a^*, \bm b^*) = \bm 0$ and the Jacobian $\frac{\partial f}{\partial \bm b}(\bm a^*, \bm b^*)$ is invertible, then the Jacobian of $g$ (the PnP solver) with respect to an input $\bm a$ (e.g., $\bm x$ ) is:

$\frac{\partial g}{\partial \bm a} = -\left[\frac{\partial f}{\partial \bm b}\right]^{-1} \left[\frac{\partial f}{\partial \bm a} \right]$

This allows computing $\frac{\partial \bm y}{\partial \bm x}$ , $\frac{\partial \bm y}{\partial \bm z}$ , and $\frac{\partial \bm y}{\partial K}$ .

Implementation Details:

PnP Solver: The paper uses the Levenberg-Marquardt (LM) algorithm (specifically, SOLVEPNP_ITERATIVE from OpenCV) as the PnP solver. The stationary condition required by IFT is satisfied by any local minimum found by LM.
Pose Parameterization: The axis-angle representation for camera pose ( $m=6$ dimensions) was found to yield the best results.
Automatic Differentiation: Partial derivatives for constructing $f$ and its Jacobians are computed using PyTorch's autograd package.
Forward Pass: The PnP solver (e.g., LM) computes the pose $\bm y$ from inputs $\bm x, \bm z, K$ , and an optional initial pose $y^{(0)}$ .
Backward Pass: The Jacobians $\frac{\partial g}{\partial \bm x}$ , $\frac{\partial g}{\partial \bm z}$ , $\frac{\partial g}{\partial K}$ are computed using the IFT formula. These Jacobians are then used to backpropagate the loss gradient $\nabla \bm y$ to the inputs:

$\nabla \bm x = \left[\frac{\partial g}{\partial \bm x}\right]^T \nabla \bm y$

$\nabla \bm z = \left[\frac{\partial g}{\partial \bm z}\right]^T \nabla \bm y$

$\nabla K = \left[\frac{\partial g}{\partial K}\right]^T \nabla \bm y$
Objective Choice: A least-squares (LS) objective for PnP is used, as its sensitivity to errors encourages the network to learn parameters that avoid outliers. Robust objectives might block error signals, destabilizing learning.

Applications and Experiments:

The paper validates BPnP through several experiments:

Pose Estimation (Section 4.1):
- Goal: Given a known 3D object structure $\bm z$ and camera intrinsics $K$ , train a network $h(I; \bm \theta)$ to predict 2D keypoints $\bm x$ from an image $I$ . The pose $\bm y$ is then found by $g(\bm x, \bm z, K)$ .
- Loss Function:
  
  $l(\bm x, \bm y) = \left\lVert \pi(\bm z|\bm y, K) - \pi(\bm z|\bm y^*, K) \right\rVert_2^2 + \lambda R(\bm x, \bm y)$
  
  where $\bm y^*$ is the ground truth pose, and $R(\bm x, \bm y) = \left\lVert \bm x - \pi(\bm z|\bm y, K) \right\rVert_2^2$ is a regularization term.
- Gradient Update:
  
  $\frac{\partial \ell}{\partial \bm \theta} = \frac{\partial l}{\partial \bm y} \frac{\partial g}{\partial \bm x} \frac{\partial h}{\partial \bm \theta} + \frac{\partial l}{\partial \bm x} \frac{\partial h}{\partial \bm \theta}$

* Results: Successfully minimized loss and converged to target pose, validating $\frac{\partial g}{\partial \bm x}$ . The regularization term was crucial for keypoints $\bm x$ to converge to desired locations.

Structure-from-Motion (SfM) with Calibrated Cameras (Section 4.2):
- Goal: Given 2D feature tracks $\{\bm x^{(j)}\}_{j=1}^N$ across $N$ frames and known intrinsics $K$ , estimate the 3D structure $\bm z$ and camera poses $\{\bm y^{(j)}\}_{j=1}^N$ . The 3D structure $\bm z$ is parameterized by a network $h(\mathbf{1}^{\otimes} ; \bm\theta)$ with a fixed 1-tensor input.
- Algorithm:
1. Initialize network parameters $\bm\theta$ . 2. In a loop: a. Predict 3D structure: $\bm z \leftarrow h(\mathbf{1}^{\otimes}; \bm\theta)$ . b. For each frame $j$ , select visible 3D points $\bm z^{(j)}$ . c. Compute pose: $\bm y^{(j)} \leftarrow g(\bm x^{(j)}, \bm z^{(j)}, K, \bm y^{(j)})$ . d. Compute loss (sum of squared reprojection errors): $l(\{ \bm y^{(j)} \}_{j=1}^N, \bm z) = \sum_{j=1}^N \rVert \bm x^{(j)} - \pi(\bm z^{(j)}|\bm y^{(j)}, K) \rVert_2^2$ . e. Update $\bm \theta$ by backpropagating through PnP (using $\frac{\partial g}{\partial \bm z^{(j)}}$ ). * Results: Successfully recovered 3D structure and camera poses from random initialization, validating $\frac{\partial g}{\partial \bm z}$ .
Camera Calibration (Section 4.3):
- Goal: Given 2D-3D correspondences $(\bm x, \bm z)$ , estimate camera intrinsic parameters $K = [f_x, f_y, c_x, c_y]$ . These parameters are predicted by a simple network $h(\bm \theta) = 1000 \cdot \text{sigmoid}(\bm \theta)$ .
- Loss Function: Sum of squared reprojection errors $l(K, \bm y) = \rVert \bm x - \pi(\bm z|\bm y, K) \rVert_2^2$ .
- Gradient Update: Involves $\frac{\partial g}{\partial K}$ .
- Results: Successfully recovered ground truth intrinsic parameters, validating $\frac{\partial g}{\partial K}$ .
Object Pose Estimation (Section 5):
- Pipeline:
  
  fig://arch2
  1. Backbone (HRNet): Predicts landmark heatmaps $\Phi$ from an input image.
  2. DSNT (Differentiable Spatial to Numerical Transform): Converts heatmaps $\Phi$ to 2D landmark coordinates $\bm x$ .
  3. BPnP: Computes pose $\bm y$ from $\bm x$ and known 3D object landmarks $\bm z$ .
- Loss Functions:
  - Heatmap Loss: $\ell_h = \text{MSE}(\Phi, \Phi^*)$ , where $\Phi^*$ are ground truth heatmaps.
  - Pose Loss (using BPnP): $\ell_p = \left\lVert \pi(\bm z\,|\;\bm y, K) - \bm x^* \right\rVert_F^2 + R(\bm x, \bm y)$ , where $\bm x^*$ are ground truth 2D landmarks.
  - Mixture Loss: $\ell_m = \ell_h + \beta \left\lVert \pi(\bm z\,|\;\bm y, K) - \bm x^* \right\rVert_F^2$ . (Regularizer $R(\bm x, \bm y)$ is not needed here as $\ell_h$ acts as regularization).
- Dataset: LINEMOD. 3D models represented by 15 landmarks (Farthest Point Sampling).
- Metrics: Average 3D Distance (ADD/ADD-S) and 2D projection error.
- Results:
  - The mixture loss $\ell_m$ achieved the highest accuracy on ADD(-S), outperforming models trained solely with $\ell_h$ or $\ell_p$ . This indicates that the geometric constraints from PnP (via $\ell_p$ ) provide beneficial correction signals to the feature-based heatmap learning.
  - Models trained with $\ell_h$ and $\ell_m$ showed higher average accuracy than PVNet (Kumar et al., 2019) on both ADD(-S) and 2D projection metrics, though training setups differed.
  - The results demonstrate that integrating geometric optimization (PnP) directly into the end-to-end learning pipeline via BPnP improves performance by leveraging both feature-based learning and geometric constraints.

Pseudocode for BPnP Layer (Conceptual):

class BPnP(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x_2d, z_3d, K, initial_pose_guess=None):
        # x_2d: (N, 2) tensor of 2D points
        # z_3d: (N, 3) tensor of 3D points
        # K: (3, 3) camera intrinsic matrix
        
        # 1. Solve PnP (e.g., using LM algorithm)
        # This is a non-differentiable operation in its standard form.
        # y_pose = pnp_solver(x_2d, z_3d, K, initial_pose_guess)
        # For OpenCV's solvePnP, ensure inputs/outputs are in correct format
        
        # Example: Using an iterative solver like Levenberg-Marquardt
        y_pose = solve_pnp_iterative(x_2d.detach().numpy(), 
                                     z_3d.detach().numpy(), 
                                     K.detach().numpy(), 
                                     initial_pose_guess) # convert to torch tensor
        
        # Save inputs for backward pass
        ctx.save_for_backward(x_2d, z_3d, K, y_pose)
        return y_pose

    @staticmethod
    def backward(ctx, grad_output_y):
        # grad_output_y is the gradient of the loss w.r.t. the output pose y
        x_2d, z_3d, K, y_pose = ctx.saved_tensors

        # 2. Construct the constraint function f and its Jacobians
        # f(a,b) = d_objective / d_pose (where b=pose, a=inputs)
        # We need d_f / d_pose (Hessian-like term) and d_f / d_inputs
        
        # For y_pose = g(x_2d, z_3d, K)
        # We need d_g/d_x2d, d_g/d_z3d, d_g/d_K
        
        # Let o(x, y, z, K) be the PnP objective function (sum of squared reprojection errors)
        # f(x, y, z, K) = grad_y o(x, y, z, K)
        
        # df_dy = jacobian(f, y_pose) # This is effectively (d^2 o) / (dy^2)
        # df_dx = jacobian(f, x_2d) # This is (d^2 o) / (dy dx)
        # df_dz = jacobian(f, z_3d) # (d^2 o) / (dy dz)
        # df_dK = jacobian(f, K)   # (d^2 o) / (dy dK)
        
        # These Jacobians can be computed using PyTorch's autograd on the reprojection error computation
        # by re-evaluating at y_pose.

        # Example for d_g/d_x2d:
        # grad_y_pose_wrt_x2d = -inv(df_dy) @ df_dx
        
        # Placeholder for actual Jacobian computations based on IFT:
        # This requires careful implementation of the partial derivatives of f
        # as described in Section 3.2 and 3.3 of the paper.
        # For instance, to get df_dy and df_dx:
        # Re-evaluate reprojection error r_i = x_i - pi(z_i | y, K)
        # f_j = sum_i <r_i, -2 * (d pi_i / d y_j)>
        # Then compute derivatives of f_j w.r.t y_k and x_l.
        
        # grad_x2d = None
        # grad_z3d = None
        # grad_K = None

        # if ctx.needs_input_grad[0]: # Gradient for x_2d
        #     J_yx = compute_jacobian_y_wrt_x(x_2d, z_3d, K, y_pose) # via IFT
        #     grad_x2d = J_yx.T @ grad_output_y
        # if ctx.needs_input_grad[1]: # Gradient for z_3d
        #     J_yz = compute_jacobian_y_wrt_z(x_2d, z_3d, K, y_pose) # via IFT
        #     grad_z3d = J_yz.T @ grad_output_y
        # if ctx.needs_input_grad[2]: # Gradient for K
        #     J_yK = compute_jacobian_y_wrt_K(x_2d, z_3d, K, y_pose) # via IFT
        #     grad_K = J_yK.T @ grad_output_y
            
        # The actual implementation would involve setting up the equations from (10)-(14)
        # and solving the linear system for the gradients. PyTorch's autograd can be
        # leveraged to compute the partial derivatives of f.

        # Simplified sketch of computing Jacobians (requires careful implementation of equations 10-14):
        # Assume y is parameterized by m parameters (e.g., 6 for axis-angle)
        # K_params could be focal lengths and principal point (4 parameters)
        
        # Compute df_dy (m x m matrix), df_dx (m x 2n matrix), etc.
        # using PyTorch's autograd on the expression for f_j
        
        # inv_df_dy = torch.inverse(df_dy)
        
        # J_yx = -inv_df_dy @ df_dx
        # J_yz = -inv_df_dy @ df_dz
        # J_yK = -inv_df_dy @ df_dK_params (if K is parameterized)
        
        # grad_x2d = J_yx.T @ grad_output_y
        # grad_z3d = J_yz.T @ grad_output_y
        # grad_K_params = J_yK.T @ grad_output_y
        
        # The authors provide their PyTorch implementation at http://github.com/BoChenYS/BPnP

        # Return gradients for x_2d, z_3d, K, initial_pose_guess (None for last)
        # return grad_x2d, grad_z3d, grad_K, None 
        raise NotImplementedError("Actual backward pass requires full IFT Jacobian computation as in the paper's code.")

Conclusion:

BPnP provides a principled way to integrate geometric optimization solvers like PnP into end-to-end deep learning pipelines. By enabling backpropagation through the PnP solver using implicit differentiation, it allows geometric constraints to directly influence feature learning. This approach has shown promising results in various geometric vision tasks and particularly improves object pose estimation accuracy by combining the strengths of feature-based learning and geometric reasoning. The work paves the way for designing new models that can leverage established geometric solvers within deep learning frameworks.

Markdown Report Issue