DeepHuman: 3D Human Reconstruction from a Single Image

Published 15 Mar 2019 in cs.CV | (1903.06473v2)

Abstract: We propose DeepHuman, an image-guided volume-to-volume translation CNN for 3D human reconstruction from a single RGB image. To reduce the ambiguities associated with the surface geometry reconstruction, even for the reconstruction of invisible areas, we propose and leverage a dense semantic representation generated from SMPL model as an additional input. One key feature of our network is that it fuses different scales of image features into the 3D space through volumetric feature transformation, which helps to recover accurate surface geometry. The visible surface details are further refined through a normal refinement network, which can be concatenated with the volume generation network using our proposed volumetric normal projection layer. We also contribute THuman, a 3D real-world human model dataset containing about 7000 models. The network is trained using training data generated from the dataset. Overall, due to the specific design of our network and the diversity in our dataset, our method enables 3D human model estimation given only a single image and outperforms state-of-the-art approaches.

Abstract PDF Upgrade to Chat

Citations (320)

View on Semantic Scholar

Summary

The paper introduces dense semantic representations leveraging SMPL parameters to guide 3D reconstruction of invisible and detailed human areas.
It proposes an innovative network architecture that integrates multi-scale image features with volumetric transformation to enhance surface geometry recovery.
Experiments on the THuman dataset yield higher Intersection-over-Union metrics, demonstrating robustness in capturing complex poses and clothing variations.

Overview of "DeepHuman: 3D Human Reconstruction from a Single Image"

The paper "DeepHuman: 3D Human Reconstruction from a Single Image" introduces a novel approach for generating 3D human models from a single RGB image. This approach, named DeepHuman, utilizes a convolutional neural network (CNN) architecture to perform volume-to-volume translation, effectively reconstructing 3D human geometry. The network is augmented with dense semantic representations derived from the Skinned Multi-Person Linear (SMPL) model to address the challenges associated with reconstructing invisible areas and surface details.

Key Contributions

Dense Semantic Representation: The study proposes using dense semantic representations as a complementary input to traditional image data. This representation is generated by projecting SMPL model parameters into both 2D and 3D spaces to guide the reconstruction process.
Network Architecture: DeepHuman comprises an innovative network design that integrates multi-scale image features into the 3D space using volumetric feature transformation. This enables more accurate surface geometry recovery, particularly in areas where conventional methods struggle due to a lack of visibility.
Normal Refinement Network: To enhance visible surface details, the authors introduce a normal refinement network. This component utilizes a volumetric normal projection layer to improve the resolution of surface features like clothing wrinkles and hairstyle.
THuman Dataset: The paper also introduces the THuman dataset, consisting of approximately 7,000 3D human models. This dataset is instrumental in training the network, providing diverse examples that improve model generalization to real-world scenarios.

Numerical Results and Implications

DeepHuman demonstrates superior performance over state-of-the-art methods, notably achieving higher accuracy in 3D reconstruction as evidenced by favorable Intersection-over-Union (IoU) metrics. These improvements indicate enhanced robustness and fidelity in capturing complex human poses and clothing variations. The results from experiments on both synthetic and real-world images suggest that the model effectively handles various challenges posed by single-image input, including depth ambiguities and occlusions.

Implications and Future Directions

The implications of this research are significant for fields requiring realistic human models, such as virtual reality (VR), augmented reality (AR), and digital content creation. By reducing the input requirement to a single image while maintaining high reconstruction fidelity, this method paves the way for more accessible and versatile applications.

Theoretically, the integration of semantic representation and volumetric transformations offers a promising avenue for further exploration in 3D computer vision. Future work may focus on enhancing the synthesis of fine details, incorporating temporal consistency for video-based inputs, and expanding the versatility of the model to handle more diverse clothing and accessory conditions.

In conclusion, DeepHuman represents a step forward in 3D human reconstruction from monocular images. Its methodological innovations and the introduction of the THuman dataset provide a valuable contribution to ongoing research efforts in artificial intelligence and computer vision domains.