Body Meshes as Points

Published 6 May 2021 in cs.CV | (2105.02467v2)

Abstract: We consider the challenging multi-person 3D body mesh estimation task in this work. Existing methods are mostly two-stage based--one stage for person localization and the other stage for individual body mesh estimation, leading to redundant pipelines with high computation cost and degraded performance for complex scenes (e.g., occluded person instances). In this work, we present a single-stage model, Body Meshes as Points (BMP), to simplify the pipeline and lift both efficiency and performance. In particular, BMP adopts a new method that represents multiple person instances as points in the spatial-depth space where each point is associated with one body mesh. Hinging on such representations, BMP can directly predict body meshes for multiple persons in a single stage by concurrently localizing person instance points and estimating the corresponding body meshes. To better reason about depth ordering of all the persons within the same scene, BMP designs a simple yet effective inter-instance ordinal depth loss to obtain depth-coherent body mesh estimation. BMP also introduces a novel keypoint-aware augmentation to enhance model robustness to occluded person instances. Comprehensive experiments on benchmarks Panoptic, MuPoTS-3D and 3DPW clearly demonstrate the state-of-the-art efficiency of BMP for multi-person body mesh estimation, together with outstanding accuracy. Code can be found at: https://github.com/jfzhang95/BMP.

Abstract PDF Upgrade to Chat

Citations (60)

View on Semantic Scholar

Summary

The paper introduces a unified single-stage framework that represents each person as a point in spatial-depth space, streamlining multi-person 3D mesh recovery.
It employs an inter-instance ordinal depth loss and keypoint-aware augmentation to ensure consistent depth estimation and robust occlusion handling.
Experimental results on benchmarks like Panoptic, MuPoTS-3D, and 3DPW demonstrate improved efficiency and accuracy compared to traditional two-stage approaches.

Overview of "Body Meshes as Points"

The paper "Body Meshes as Points" by Jianfeng Zhang et al. addresses the intricate problem of multi-person 3D body mesh estimation from single images. Traditional methodologies typically employ two-stage pipelines; the first stage focuses on person localization, followed by individual body mesh estimation. These methods, however, exhibit inefficiency, particularly in complex scenes involving occlusion, due to redundant processing pipelines and elevated computational costs.

The proposed solution, the Body Meshes as Points (BMP) model, introduces a single-stage architecture aimed at enhancing both computational efficiency and performance. BMP represents each person within an image as a point in the spatial-depth space, allowing for a streamlined joint localization and mesh estimation process. This method simultaneously predicts the location and 3D body mesh of each person, achieved by mapping each detected instance to corresponding localized points within the spatial-depth framework.

Key Contributions

Unified Single-Stage Framework: BMP introduces a novel single-stage approach for multi-person mesh recovery by representing instances as points. This contrasts with conventional multi-stage approaches and allows for simultaneous person localization and mesh recovery, significantly simplifying the pipeline.
Depth Ordering with Ordinal Loss: To maintain depth consistency among overlapping instances, BMP employs an inter-instance ordinal depth loss. This loss function ensures coherent depth estimations, which is critical in scenarios with multiple persons in varying spatial planes.
Robustness to Occlusion: The model incorporates a keypoint-aware augmentation strategy designed to mitigate occlusion challenges. This augmentation helps the model to focus on structural cues, thereby enhancing robustness under occluded conditions.
Experimental Validation: BMP is extensively validated against state-of-the-art methods on benchmarks such as Panoptic, MuPoTS-3D, and 3DPW. The model consistently achieves superior results, both in terms of efficiency and accuracy, demonstrating its efficacy over traditional two-stage methodologies.

Methodological Details

BMP's architecture relies on a Feature Pyramid Network (FPN) to process multi-scale features, which supports segmenting people into various depth levels. The spatial information is discretized across a 3D coordinate system to facilitate person instance representation, localization, and mesh regression in parallel. The framework also adjusts training with pseudo ordinal relations derived from pre-trained models on 3D datasets, fostering improved generalization and better handling of in-the-wild scenarios without explicit 3D depth annotations.

Implications and Future Directions

The implications of BMP are profound in fields requiring accurate human mesh estimations, such as virtual reality (VR), animation, and human-computer interaction (HCI). Its efficiency and robustness in handling occlusions and crowded environments make it particularly suitable for real-time applications.

For future research, exploring the integration of temporal data to capture motion dynamics could further enhance BMP's capabilities. Additionally, extending the current framework to incorporate interaction modeling among multiple persons or the inclusion of environment context might provide added performance improvements in more complex scenes.

In summary, the BMP model offers a significant step forward for single-stage multi-person 3D body mesh recovery, paving the way for more efficient and comprehensive human pose and shape estimation from monocular imagery.