AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos

Published 4 Aug 2024 in cs.CV | (2408.02110v2)

Abstract: Despite progress in human motion capture, existing multi-view methods often face challenges in estimating the 3D pose and shape of multiple closely interacting people. This difficulty arises from reliance on accurate 2D joint estimations, which are hard to obtain due to occlusions and body contact when people are in close interaction. To address this, we propose a novel method leveraging the personalized implicit neural avatar of each individual as a prior, which significantly improves the robustness and precision of this challenging pose estimation task. Concretely, the avatars are efficiently reconstructed via layered volume rendering from sparse multi-view videos. The reconstructed avatar prior allows for the direct optimization of 3D poses based on color and silhouette rendering loss, bypassing the issues associated with noisy 2D detections. To handle interpenetration, we propose a collision loss on the overlapping shape regions of avatars to add penetration constraints. Moreover, both 3D poses and avatars are optimized in an alternating manner. Our experimental results demonstrate state-of-the-art performance on several public datasets.

Abstract PDF HTML Upgrade to Chat

Authors (4)

Summary

The paper introduces a framework that employs personalized neural avatars to overcome occlusions and depth ambiguities in multi-human interactions.
It leverages layered volume rendering and a composite loss function to significantly reduce MPJPE and boost PCP3D across multiple benchmarks.
The method enhances applications in VR, biomechanics, and HCI by enabling precise motion capture in close-contact, interaction-heavy scenarios.

An Expert Overview of "AvatarPose: Avatar-Guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-View Videos"

"AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos" by Feichi Lu et al. addresses the enduring challenge of accurately estimating the 3D pose and shape of multiple closely interacting humans using consumer-grade sensor setups. The complexity inherent in such scenarios arises from significant occlusions and depth ambiguities, which traditional methods relying on 2D joint estimations cannot reliably overcome.

Methodological Approach

The authors propose AvatarPose, a novel framework that utilizes personalized implicit neural avatars as priors to achieve robust and precise 3D pose estimations from sparse multi-view video inputs. The methodology comprises two primary phases:

Avatar Reconstruction: The process begins with the efficient creation of avatars for each individual in the scene using a neural radiance field variant, integrating an SMPL-based deformation module for articulation. Layered volume rendering is then employed to jointly optimize the avatars through a collective rendering loss, ensuring coherent multi-person scene modeling.
Pose Optimization: Leveraging the avatars as priors, the method iterates between pose refinement and avatar adjustments. Poses are optimized by minimizing a composite objective function consisting of color and silhouette rendering losses, with collision loss ensuring interpenetration constraints. This alternating optimization framework enhances the robustness and accuracy of the estimation, particularly in high-contact scenarios.

Experimental Setup

The efficacy of the proposed method is validated against state-of-the-art approaches on several public datasets, notably Hi4D, CHI3D, Shelf, and MultiHuman. The comparison metrics include MPJPE, PCP3D, $\text{AP}_{K}$ , and recall rates, with AvatarPose demonstrating superior performance across these benchmarks.

Results

Quantitative results indicate a significant reduction in MPJPE and an increase in PCP3D and $\text{AP}_{K}$ compared to existing methods such as Graph, MvP, Faster VoxelPose, MVPose, and 4DAssociation. For instance, on the Hi4D dataset, AvatarPose achieved an MPJPE of 32.10 mm and a PCP of 96.90%, thereby surpassing the closest competitor 4DAssociation, which recorded 41.29 mm and 88.62% respectively.

Qualitatively, AvatarPose is shown to handle occlusions and close interactions more effectively, producing visually consistent 3D reconstructions even in complex scenarios involving hugging or intertwined limbs. The incorporation of rendering-based loss functions and collision avoidance proves crucial in mitigating the errors typically propagated by noisy 2D detections.

Implications and Future Directions

The introduction of personalized avatars as priors marks a substantial shift from conventional SMPL-based or entirely learning-based approaches. By harnessing fine-grained color and silhouette data for pose optimization, the method establishes a new benchmark for 3D human pose estimation in closely interacting contexts.

From a practical standpoint, AvatarPose can be transformative for applications requiring precise motion capture in interaction-heavy environments, such as virtual reality, biomechanics, and advanced human-computer interaction systems. Furthermore, the modular design of the method allows potential extension into related domains, including gesture recognition and real-time performance capture.

Theoretically, the integration of personalized avatars with joint optimization schemes opens new avenues for research in avatar-based modeling and multi-person interaction analysis. Future developments might focus on refining pose initialization strategies to prevent local minima and extending the scope to hand and facial expressions by incorporating models like MANO or SMPL-X.

Conclusion

"AvatarPose" presents a robust and innovative solution to the problem of 3D pose estimation in multi-human interactions, demonstrating significant improvements over existing methods through the use of neural avatars as priors. The approach not only enhances pose accuracy but also introduces a scalable framework adaptable to various interaction-rich scenarios.

Markdown Report Issue