VoxelTrack: Multi-Person 3D Human Pose Estimation and Tracking in the Wild

Published 5 Aug 2021 in cs.CV | (2108.02452v1)

Abstract: We present VoxelTrack for multi-person 3D pose estimation and tracking from a few cameras which are separated by wide baselines. It employs a multi-branch network to jointly estimate 3D poses and re-identification (Re-ID) features for all people in the environment. In contrast to previous efforts which require to establish cross-view correspondence based on noisy 2D pose estimates, it directly estimates and tracks 3D poses from a 3D voxel-based representation constructed from multi-view images. We first discretize the 3D space by regular voxels and compute a feature vector for each voxel by averaging the body joint heatmaps that are inversely projected from all views. We estimate 3D poses from the voxel representation by predicting whether each voxel contains a particular body joint. Similarly, a Re-ID feature is computed for each voxel which is used to track the estimated 3D poses over time. The main advantage of the approach is that it avoids making any hard decisions based on individual images. The approach can robustly estimate and track 3D poses even when people are severely occluded in some cameras. It outperforms the state-of-the-art methods by a large margin on three public datasets including Shelf, Campus and CMU Panoptic.

Abstract PDF Upgrade to Chat

Citations (64)

View on Semantic Scholar

Summary

The paper presents a novel voxel-based method that directly estimates 3D human poses from multiple camera views.
It integrates re-identification features to achieve robust multi-person tracking across frames in dynamic scenes.
Empirical results demonstrate significant performance improvements over state-of-the-art methods on diverse datasets.

VoxelTrack: Multi-Person 3D Human Pose Estimation and Tracking in the Wild

Introduction

The paper "VoxelTrack: Multi-Person 3D Human Pose Estimation and Tracking in the Wild" introduces a novel approach for 3D human pose estimation and tracking utilizing multiple camera views. VoxelTrack is designed to handle environments where cameras are widely separated, facilitating multi-person tracking without necessitating the establishment of cross-view correspondence from noisy 2D poses. The paper outlines a voxel-based methodology that provides estimations directly from 3D data, circumventing the pitfalls associated with traditional 2D pose aggregation techniques.

Voxel-Based Representation

VoxelTrack employs a 3D voxel grid representation to discretize the space, which is fundamental to its approach. For each voxel, a feature vector is derived by averaging body joint heatmaps that are inversely projected from various camera views. This voxel representation serves as a basis to estimate 3D poses, characterized by predicting which voxel contains specific body joints. The originality of this representation lies in its ability to directly derive spatial features without relying heavily on individual camera images, enhancing robustness, especially under occlusion scenarios.

Re-Identification and Tracking

In addition to pose estimation, VoxelTrack incorporates re-identification (Re-ID) features calculated for each voxel to accomplish human tracking across multiple frames. By leveraging these Re-ID features, the system efficiently tracks the 3D poses over time, integrating tracking capabilities seamlessly with pose estimation. This dual capability mitigates issues related to occlusion as it does not make definitive decisions based on single camera views, thus enhancing the reliability of tracking in dynamic scenes.

Performance and Comparison

The paper demonstrates VoxelTrack's superior performance against existing state-of-the-art methods across several public datasets including Shelf, Campus, and CMU Panoptic. The significant margin by which VoxelTrack outperforms previous techniques underlines the efficacy of the voxel representation and the integrated approach to pose estimation and tracking. This is corroborated by the paper's empirical results which highlight robust tracking even in complex scenarios involving occlusions and varied camera configurations.

Implications and Future Work

The implications of VoxelTrack are profound in contexts requiring accurate human pose tracking such as surveillance, sports analysis, and interactive systems within augmented and virtual reality environments. The voxel-based approach presents a paradigm shift towards leveraging 3D spatial information more effectively, potentially propelling future advances in human motion analysis and multi-camera vision systems.

Future developments might focus on optimizing computational efficiency and scalability of the voxel-based representation, as well as integrating machine learning models to further enhance pose prediction accuracy. Additionally, exploring real-time deployment possibilities for VoxelTrack in commercial applications could substantially broaden its impact.

Conclusion

VoxelTrack represents a significant advancement in the domain of 3D human pose estimation and tracking. By harnessing voxel-based 3D representations and integrating re-identification features for tracking, it addresses substantial challenges posed by traditional 2D-based methods. The improvements in tracking robustness and accuracy, particularly under occlusion, offer promising opportunities for deployment in a variety of complex, real-world environments. Future research and development can potentially extend its application scope while refining its computational demands for broader and more efficient use.