TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation
Abstract: We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at https://tokenhmr.is.tue.mpg.de.
- Pose-conditioned joint angle limits for 3D human pose reconstruction. In Computer Vision and Pattern Recognition (CVPR), 2015.
- 2d human pose estimation: New benchmark and state of the art analysis. In Computer Vision and Pattern Recognition (CVPR), 2014.
- Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In CVPR, pages 8726–8737, 2023.
- Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision (ECCV), 2016.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transaction on Pattern Analysis and Machine Intelligence (TPAMI), 2019.
- Learning to estimate robust 3d human mesh from in-the-wild crowded scenes. In Computer Vision and Pattern Recognition (CVPR), pages 1475–1484, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- Learning to regress bodies from images using differentiable semantic rendering. In International Conference on Computer Vision (ICCV), 2021.
- POCO: 3D pose and shape estimation using confidence. In International Conference on 3D Vision (3DV), 2024.
- Human pose as compositional tokens. In Computer Vision and Pattern Recognition (CVPR), 2023.
- Hierarchical kinematic human mesh recovery. European Conference on Computer Vision (ECCV), 2020.
- Humans in 4D: Reconstructing and tracking humans with transformers. In International Conference on Computer Vision (ICCV), 2023.
- Ava: A video dataset of spatio-temporally localized atomic visual actions. In Computer Vision and Pattern Recognition (CVPR), pages 6047–6056, 2018.
- Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European Conference on Computer Vision (ECCV), 2022.
- Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
- Gaussian error linear units (gelus). arXiv preprint arXiv: 1606.08415, 2016.
- Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2014.
- Motiongpt: Human motion as a foreign language. arXiv preprint arXiv: 2306.14795, 2023.
- Learning effective human pose estimation from inaccurate annotation. In Computer Vision and Pattern Recognition (CVPR), 2011.
- Exemplar fine-tuning for 3D human pose fitting towards in-the-wild 3D human pose estimation. In International Conference on 3D Vision (3DV), 2020.
- End-to-end recovery of human shape and pose. In Computer Vision and Pattern Recognition (CVPR), 2018.
- Learning 3D human dynamics from video. In Computer Vision and Pattern Recognition (CVPR), 2019.
- EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. In International Conference on Computer Vision (ICCV), 2023.
- VIBE: Video inference for human body pose and shape estimation. In Computer Vision and Pattern Recognition (CVPR), 2020.
- PARE: Part attention regressor for 3D human body estimation. In International Conference on Computer Vision (ICCV), 2021a.
- SPEC: Seeing people in the wild with an estimated camera. In International Conference on Computer Vision (ICCV), 2021b.
- Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In International Conference on Computer Vision (ICCV), 2019a.
- Convolutional mesh regression for single-image human shape reconstruction. In Computer Vision and Pattern Recognition (CVPR), 2019b.
- Probabilistic modeling for human mesh recovery. In International Conference on Computer Vision (ICCV), 2021.
- Unite the people: Closing the loop between 3D and 2D human representations. In Computer Vision and Pattern Recognition (CVPR), 2017.
- HybrIK: A hybrid analytical-neural inverse kinematics solution for 3D human pose and shape estimation. In Computer Vision and Pattern Recognition (CVPR), 2021.
- CLIFF: Carrying location information in full frames into human pose and shape estimation. In European Conference on Computer Vision (ECCV), 2022.
- One-stage 3D whole-body mesh recovery with component aware transformer. In Computer Vision and Pattern Recognition (CVPR), 2023.
- Mesh graphormer. In International Conference on Computer Vision (ICCV), 2021.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
- SMPL: A skinned multi-person linear model. In Transactions on Graphics (TOG), 2015.
- AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision (ICCV), 2019.
- Monocular 3d human pose estimation in the wild using improved cnn supervision. In International Conference on 3D Vision (3DV), 2017.
- COAP: Compositional articulated occupancy of people. In Computer Vision and Pattern Recognition (CVPR), 2022.
- Neuralannot: Neural annotator for 3d human mesh training sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2299–2307, 2022.
- Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In International Conference on 3D Vision (3DV), 2018.
- Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), 2019.
- Zero-shot text-to-image generation. International Conference on Machine Learning (ICML), 2021.
- Generating diverse high-fidelity images with vq-vae-2. Conference on Neural Information Processing Systems (NeurIPS), 2019.
- PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In Computer Vision and Pattern Recognition (CVPR), 2020.
- Metric-scale truncation-robust heatmaps for 3D human pose estimation. In IEEE Int Conf Automatic Face and Gesture Recognition (FG), 2020.
- HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation. In Computer Vision and Pattern Recognition (CVPR), 2023.
- Monocular, One-stage, Regression of Multiple 3D People. In International Conference on Computer Vision (ICCV), 2021.
- Putting People in their Place: Monocular Regression of 3D People in Depth. In Computer Vision and Pattern Recognition (CVPR), 2022.
- Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2023.
- Pose-ndf: Modeling human pose manifolds with neural distance fields. In European Conference on Computer Vision (ECCV), 2022.
- 3D human pose estimation via intuitive physics. In Computer Vision and Pattern Recognition (CVPR), 2023.
- Neural discrete representation learning. Conference on Neural Information Processing Systems (NeurIPS), 2017.
- Attention is all you need. Conference on Neural Information Processing Systems (NeurIPS), 2017.
- Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In European Conference on Computer Vision (ECCV), 2018.
- Zolly: Zoom focal length correctly for perspective-distorted human mesh reconstruction. In International Conference on Computer Vision (ICCV), 2023.
- Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv, 2017.
- ECON: Explicit clothed humans optimized via normal integration. In Computer Vision and Pattern Recognition (CVPR), 2023.
- GHUM & GHUML: Generative 3D human shape and articulated pose models. In Computer Vision and Pattern Recognition (CVPR), 2020.
- ViTPose: Simple vision transformer baselines for human pose estimation. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In International Conference on Computer Vision (ICCV), pages 11446–11456, 2021.
- T2m-gpt: Generating human motion from textual descriptions with discrete representations. In Computer Vision and Pattern Recognition (CVPR), 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.