TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

Published 25 Apr 2024 in cs.CV | (2404.16752v1)

Abstract: We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at https://tokenhmr.is.tue.mpg.de.

Abstract PDF HTML Upgrade to Chat

References (62)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces TokenHMR, which uses Threshold-Adaptive Loss Scaling (TALS) to counteract camera projection biases and improve 3D pose accuracy.
It employs a tokenized pose representation with a Vector Quantized-VAE that confines predictions to a finite set of valid human poses, enhancing robustness.
Experimental results on benchmarks like EMDB and 3DPW show that TokenHMR significantly reduces 3D errors compared to previous state-of-the-art models.

Advancing Human Mesh Recovery with Tokenized Pose Representation

Introduction

The ongoing challenge in the field of 3D human pose and shape (HPS) estimation from single images revolves around achieving high accuracy in both the estimated 3D pose and its alignment with 2D images. Recent advances often face a paradox where accuracy in 2D keypoint predictions adversely impacts the accuracy in 3D pose predictions. Key contributors to this problem include biases inherent in pseudo-ground-truth data and the discrepancies introduced by approximate camera projection models. The paper presents a novel methodology, TokenHMR, which introduces the Threshold-Adaptive Loss Scaling (TALS) and a tokenized representation of human pose aiming to mitigate these issues, thus setting a new benchmark in 3D HPS estimation.

Analysis of Key Challenges

Existing methods in 3D HPS often rely on minimizing 2D keypoint loss, which inadvertently leads to inaccuracies in 3D pose predictions due to camera model approximations. This is vividly demonstrated using the BEDLAM dataset where using ground-truth 3D poses results in significant projection errors when viewed through the lens of assumed (incorrect) camera parameters. These errors significantly showcased how the high 2D fitting accuracy could lead to large deviations in 3D pose accuracy.

Innovations in TokenHMR

Threshold-Adaptive Loss Scaling (TALS)

TALS addresses the core issue where minimizing loss below a certain error threshold, due to camera biases, does not contribute to improving 3D pose accuracy and could potentially degrade it. By implementing an adaptive loss function, TALS differentially scales the loss penalties based on whether they exceed a predefined threshold informed by baseline errors from ground-truth data.

Tokenization of Human Pose

To further reduce ambiguities in predicting 3D pose from 2D keypoints, TokenHMR introduces a token-based representation system. Utilizing a Vector Quantized-Variational AutoEncoder, the system discretizes the potential human poses into a series of tokens derived from extensive motion capture data. This approach restricts the model's outputs to a finite set of valid poses, enhancing both accuracy and robustness against occlusion or partial data visibility.

Experimental Results

Detailed experiments highlight the strengths of TokenHMR across various benchmarks such as EMDB and 3DPW. When deployed on 3D benchmarks, TokenHMR outperforms existing state-of-the-art models, including HMR2.0, by substantial margins in various metrics such as 3D error reductions. The evaluation clearly indicates the advantage of the new tokenized pose representation system and the TALS method in achieving more accurate 3D human pose estimations.

Implications and Future Directions

TokenHMR not only sets new precedents in the accuracy of 3D HPS models but opens numerous avenues for future research. The tokenization of human poses offers an interesting parallel to LLMs where a limited vocabulary (tokens) can effectively represent a vast space of information (human poses). Further exploration into more refined tokenization techniques, as well as expanding the adaptability of the TALS approach across different model architectures, could provide deeper insights and improvements. Additionally, integrating more accurate camera models or dynamic models that can adapt to input data could further enhance the performance of 3D HPS systems.

In conclusion, the paper successfully addresses a significant challenge in 3D human pose estimation and introduces innovative methods that significantly mitigate biases induced by 2D projection errors, propelling the field toward more accurate and robust HPS prediction models.