HDNet: Human Depth Estimation for Multi-Person Camera-Space Localization

Published 17 Jul 2020 in cs.CV | (2007.08943v1)

Abstract: Current works on multi-person 3D pose estimation mainly focus on the estimation of the 3D joint locations relative to the root joint and ignore the absolute locations of each pose. In this paper, we propose the Human Depth Estimation Network (HDNet), an end-to-end framework for absolute root joint localization in the camera coordinate space. Our HDNet first estimates the 2D human pose with heatmaps of the joints. These estimated heatmaps serve as attention masks for pooling features from image regions corresponding to the target person. A skeleton-based Graph Neural Network (GNN) is utilized to propagate features among joints. We formulate the target depth regression as a bin index estimation problem, which can be transformed with a soft-argmax operation from the classification output of our HDNet. We evaluate our HDNet on the root joint localization and root-relative 3D pose estimation tasks with two benchmark datasets, i.e., Human3.6M and MuPoTS-3D. The experimental results show that we outperform the previous state-of-the-art consistently under multiple evaluation metrics. Our source code is available at: https://github.com/jiahaoLjh/HumanDepth.

Abstract PDF Upgrade to Chat

Citations (32)

View on Semantic Scholar

Summary

The paper introduces a novel framework that employs heatmap-based attention and a skeleton-based graph neural network for accurate multi-person root joint localization.
It reformulates depth estimation as a classification problem using soft-argmax, achieving high-resolution depth estimates with an MRPE of 77.6 mm on Human3.6M.
The framework demonstrates superior performance on both Human3.6M and MuPoTS-3D, highlighting its potential for AR, surveillance, and human-computer interaction applications.

Overview of HDNet: Human Depth Estimation for Multi-Person Camera-Space Localization

The paper presents the Human Depth Estimation Network (HDNet), a novel end-to-end framework crafted for the absolute localization of root joints in multi-person contexts within camera coordinate space. Traditional methods predominantly concentrate on estimating 3D joint locations relative to the root joint, a practice that inherently overlooks the absolute positioning of each person relative to the camera. HDNet addresses this gap by leveraging deep learning methodologies to enhance the precision of multi-person root joint localization tasks.

Contributions

HDNet introduces several innovative components, notably:

Heatmap-Based Attention Mechanism: The network predicts the 2D pose through joint heatmaps that function as attention masks. These masks facilitate the pooling of features from the specific image regions pertinent to each individual, enhancing the subsequent depth estimation task.
Graph Neural Network (GNN) for Feature Propagation: HDNet employs a skeleton-based GNN to propagate features among the joints. This approach allows for effective information exchange, contributing to a more robust depth estimation framework.
Depth Estimation as a Classification Problem: In a departure from traditional regression approaches for depth estimation, HDNet formulates this as a bin index estimation problem. This strategy allows the use of classification outputs transformed via a soft-argmax operation to achieve high-resolution and precise depth estimates.

Experimental Evaluation

The performance evaluation of HDNet was conducted on two prominent datasets: Human3.6M and MuPoTS-3D. The findings demonstrate HDNet's superiority over previous state-of-the-art systems, yielding more accurate root joint localization.

Human3.6M Dataset: HDNet achieved a mean root position error (MRPE) of 77.6 mm, with a significant improvement in the estimation of the depth component, which is crucial for camera-space positioning.
MuPoTS-3D Dataset: The framework's efficacy is further substantiated through its performance on multi-person root joint localization tasks, achieving higher average precision and recall rates under stringent thresholds compared to competing methods.

Implications and Future Work

The potential applications of HDNet span various domains including augmented reality, surveillance, and human-computer interaction, where understanding absolute human positions is imperative. The use of pre-computed heatmaps to guide subsequent depth estimation tasks offers a scalable solution with real-time applications.

The integration of a GNN for joint feature propagation is a novel approach in the context of pose estimation, suggesting that further exploration in human-specific graph-based methods could yield even more precise localization in cluttered environments.

While HDNet offers significant advancements, challenges remain. Specifically, overlapping bounding boxes and variable human sizes pose issues for generalization. Future research could explore incorporating explicit segmentation techniques or developing methodologies to adaptively adjust for different human scales, enhancing the robustness of the framework across diverse scenarios.

In conclusion, HDNet is a significant step forward in the domain of multi-person 3D pose estimation, paving the way for more refined and application-ready pose estimation solutions. Its distinctive approach to leveraging deep learning architectures for depth estimation underscores the potential of integrating pose information as an intrinsic component of more comprehensive vision models.