BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation

Published 11 Dec 2024 in cs.CV | (2412.08640v1)

Abstract: Single-image human mesh recovery is a challenging task due to the ill-posed nature of simultaneous body shape, pose, and camera estimation. Existing estimators work well on images taken from afar, but they break down as the person moves close to the camera. Moreover, current methods fail to achieve both accurate 3D pose and 2D alignment at the same time. Error is mainly introduced by inaccurate perspective projection heuristically derived from orthographic parameters. To resolve this long-standing challenge, we present our method BLADE which accurately recovers perspective parameters from a single image without heuristic assumptions. We start from the inverse relationship between perspective distortion and the person's Z-translation Tz, and we show that Tz can be reliably estimated from the image. We then discuss the important role of Tz for accurate human mesh recovery estimated from close-range images. Finally, we show that, once Tz and the 3D human mesh are estimated, one can accurately recover the focal length and full 3D translation. Extensive experiments on standard benchmarks and real-world close-range images show that our method is the first to accurately recover projection parameters from a single image, and consequently attain state-of-the-art accuracy on 3D pose estimation and 2D alignment for a wide range of images. https://research.nvidia.com/labs/amri/projects/blade/

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces BLADE's novel depth estimation approach using a trained Pelvis Depth Estimator to accurately predict T_z and enhance mesh recovery.
The paper integrates T_z-aware pose estimation with differentiable rasterization to recover true camera parameters and correct perspective distortions.
The paper demonstrates state-of-the-art performance on challenging datasets while leveraging a synthetic dataset to improve generalization in close-range images.

Overview of BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation

The paper "BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation" addresses the issue of reconstructing 3D human poses from single images, a problem traditionally challenged by perspective distortions, especially in close-proximity imagery. The proposed method, BLADE (Body mesh Learning through Accurate Depth Estimation), seeks to enhance the fidelity of human mesh recovery by accurately estimating the depth information and camera parameters without relying on heuristic weaknesses embedded in previous models.

Contributions

Accurate Depth Estimation Approach: Central to the method is the precise estimation of the Z-translation ( $T_z$ ) of a person's body from the camera, which influences the degree of perspective distortion. In contrast to many existing methods that assume orthographic projection or approximate perspective parameters through heuristics, BLADE engages a trained Pelvis Depth Estimator that robustly predicts $T_z$ from image features. The authors underscore the use of previously established techniques in depth prediction, positioning their model to excel in regions where perspective distortion is rampant.
$T_z$ -aware Pose Estimation: Incorporating the obtained $T_z$ into human mesh recovery, BLADE conditions its pose estimator on this depth data, allowing the mesh reconstruction process to accommodate perspective distortions in captured images. By employing a ControlNet-style mechanism, the pose estimation process gains $T_z$ awareness, thus improving accuracy over generic transformers used in other methods.
True Perspective Projection Parameter Recovery: After obtaining the human mesh and $T_z$ estimate, BLADE calculates focal length and horizontal translations using an optimization approach. This is achieved through differentiable rasterization to align rendered mesh projections with the human segmentation mask in the image. Thus, BLADE promises a full retrieval of perspective camera parameters, an achievement not trivial in monocular setups.
Synthetic Data Generation for Robust Generalization: Recognizing gaps in available datasets, particularly for close-range captures, the authors contribute Bedlam-cc—a comprehensive synthetic dataset rich in variation and close-to-camera perspectives. This dataset empowers the model to generalize across diverse image conditions and depth ranges, improving depth prediction accuracy compared to real-world or existing synthetic datasets.

Results and Implications

Experimental results reveal BLADE's superior performance across numerous metrics like mean per-vertex error (PVE), 2D alignment measured through Intersection over Union (IoU), and accurate recovery of focal lengths and camera translation parameters. Notably, BLADE attains state-of-the-art accuracy on challenging datasets, such as SPEC-MTP and PDHuman, confirming its robustness in managing strong perspective distortions.

Theoretical and Practical Implications:

By bypassing traditional orthographic assumptions, BLADE presents a leap toward more realistic human mesh recovery applications, bearing significance in animation, gaming, healthcare, remote interaction, and online virtual try-on services. The method also potentially provides insights into future research avenues where depth estimation and asset generalization might further empower machine vision tasks in the real world, such as autonomous driving and augmented reality.

Future Directions

In potential further work, the authors could consider expanding the setting to multi-person scenes and dynamic video inputs where sequential frames might offer substantial consistency cues. Moreover, the model’s adaptability to diverse camera types, beyond pin-hole models, and environments with complex lighting could bolster its applicability in even more varied scenarios.

The paper exemplifies a substantial step forward in fine-tuning the balance between 3D understanding and execution efficiency under visually complex circumstances, reflecting ongoing advancements in the evolving landscape of human-centric AI.

Markdown Report Issue