Monocular, One-stage, Regression of Multiple 3D People

Published 27 Aug 2020 in cs.CV | (2008.12272v4)

Abstract: This paper focuses on the regression of multiple 3D people from a single RGB image. Existing approaches predominantly follow a multi-stage pipeline that first detects people in bounding boxes and then independently regresses their 3D body meshes. In contrast, we propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP). The approach is conceptually simple, bounding box-free, and able to learn a per-pixel representation in an end-to-end manner. Our method simultaneously predicts a Body Center heatmap and a Mesh Parameter map, which can jointly describe the 3D body mesh on the pixel level. Through a body-center-guided sampling process, the body mesh parameters of all people in the image are easily extracted from the Mesh Parameter map. Equipped with such a fine-grained representation, our one-stage framework is free of the complex multi-stage process and more robust to occlusion. Compared with state-of-the-art methods, ROMP achieves superior performance on the challenging multi-person benchmarks, including 3DPW and CMU Panoptic. Experiments on crowded/occluded datasets demonstrate the robustness under various types of occlusion. The released code is the first real-time implementation of monocular multi-person 3D mesh regression.

Abstract PDF Upgrade to Chat

Citations (247)

View on Semantic Scholar

Summary

The paper introduces an end-to-end one-stage framework that directly regresses 3D human meshes from single images, eliminating the need for bounding box detections.
The paper employs a Collision-Aware Representation to distinctly differentiate overlapping body centers, significantly improving accuracy in dense scenes.
The paper demonstrates robust performance with high accuracy on occluded scenes and real-time speeds over 30 FPS, outperforming state-of-the-art methods.

Overview of "Monocular, One-stage, Regression of Multiple 3D People"

This paper presents a significant advancement in the domain of 3D human pose estimation from single RGB images by introducing ROMP—an approach for Monocular, One-stage, Regression of Multiple 3D People. Unlike traditional multi-stage methodologies where each stage addresses a specific task (e.g., detection followed by regression), ROMP unifies the entire process into a one-stage, end-to-end framework.

Key Contributions

End-to-End One-Stage Framework: ROMP eliminates the reliance on bounding box detections typically required in previous approaches. It formulates the problem as a direct per-pixel representation learning task, capable of regressing 3D human meshes from an image in a singular model pass. The framework principally includes predicting a Body Center heatmap alongside a Mesh Parameter map, thereby enabling regression of meshes at the pixel level.
Collision-Aware Representation (CAR): To address the challenge of distinguishing between overlapping human figures in images, ROMP introduces the CAR. By conceptualizing body centers as repulsive charges, the approach mitigates center collision issues in dense scenes, ensuring that body centers for each individual in the field of view are distinct and non-overlapping. This enhancement notably impacts the model's performance in crowded and occlusion-prone scenes.
Robustness to Occlusion: The proposed method demonstrates significant resilience to various occlusion types, highlighted through its performance on benchmarks featuring intricate occlusions, like 3DPW and CMU Panoptic datasets. ROMP's pixel-level representation is particularly effective in addressing both person-person and environmental occlusions.

Performance and Implications

The experimental results convincingly show the superior performance of ROMP over state-of-the-art methods like CRMH and VIBE. Notably, it achieves higher accuracy in the mean per joint position error (MPJPE) and per-vertex error metrics across several benchmarks. This indicates that the proposed holistic representation is better suited for real-world scenes, including those with occlusion and truncation. Moreover, ROMP maintains efficiency with a real-time processing speed of over 30 FPS on a standard GPU, underscoring its practical applicability for operational deployments.

Implications for Future Research

The integration of a one-stage network architecture in monocular 3D mesh regression opens several avenues for future exploration and development. Potential research directions include improving the interpretability of per-pixel estimations and extending the CAR concept for differentiating even more complex human interaction scenarios. Furthermore, extending such a framework to accommodate video data or integrating temporal coherency could enhance motion understanding and tracking in dynamic environments.

This work encourages a rethinking of the existing methodologies for 3D human mesh regression, prompting a shift towards simpler and more scalable solutions. The public release of ROMP's implementation also sets a precedent for transparency and reproducibility in complex computer vision tasks, allowing other researchers to build upon this framework to address related challenges in human pose estimation and beyond.

Markdown Report Issue