Pose Priors from Language Models

Published 6 May 2024 in cs.CV and cs.CL | (2405.03689v2)

Abstract: Language is often used to describe physical interaction, yet most 3D human pose estimation methods overlook this rich source of information. We bridge this gap by leveraging large multimodal models (LMMs) as priors for reconstructing contact poses, offering a scalable alternative to traditional methods that rely on human annotations or motion capture data. Our approach extracts contact-relevant descriptors from an LMM and translates them into tractable losses to constrain 3D human pose optimization. Despite its simplicity, our method produces compelling reconstructions for both two-person interactions and self-contact scenarios, accurately capturing the semantics of physical and social interactions. Our results demonstrate that LMMs can serve as powerful tools for contact prediction and pose estimation, offering an alternative to costly manual human annotations or motion capture data. Our code is publicly available at https://prosepose.github.io.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces ProsePose, a zero-shot framework that uses large multimodal models to enforce physical contact constraints in 3D human pose estimation.
It converts language model–derived contact constraints into loss functions, reducing reliance on costly motion capture and manual annotations.
Experimental results show significant improvements in joint error reduction and contact point accuracy across datasets like Hi4D, FlickrCI3D, and CHI3D.

Pose Priors from LLMs

Overview

The paper "Pose Priors from LLMs" introduces ProsePose, a zero-shot pose optimization framework leveraging large multimodal models (LMMs) to enforce physical contact constraints in 3D human pose estimation. The key insight of this work is that LLMs, which have been pretrained on extensive textual data, can provide a semantic prior on human pose interactions. This approach circumvents the need for expensive training datasets involving motion capture or manually annotated contact points, which are typically required by state-of-the-art methods.

Methodology

ProsePose operates in three stages:

Pose Initialization: An initial estimate of the 3D pose is obtained using a regression-based model.
Constraint Generation with LMM: An LMM generates contact constraints by analyzing the input image and outputting plausible physical contact points between different body parts. These constraints are then converted into loss functions.
Constrained Pose Optimization: The generated loss functions, along with additional predefined losses, are used to refine the initial pose estimates to accurately reflect physical contact constraints.

Experimental Results

The authors validated ProsePose on several datasets, including Hi4D, FlickrCI3D, and CHI3D for two-person interactions, and MOYO for single-person complex yoga poses. The results demonstrate that ProsePose significantly improves over existing zero-shot baselines, reducing errors (PA-MPJPE) and increasing the percentage of correct contact points (PCC).

For Hi4D, ProsePose reduced the joint PA-MPJPE to 93mm from the heuristic baseline's 116mm.
On the FlickrCI3D dataset, ProsePose achieved a joint PA-MPJPE of 58mm and an average PCC of 79.9%, outperforming the heuristic baseline's 67mm and 77.8%, respectively.
On the CHI3D dataset, ProsePose achieved an average PCC of 75.8%, showing improvement over the heuristic baseline's 74.1%.
For MOYO, ProsePose maintained a comparable PA-MPJPE to the HMR2+opt baseline but significantly improved the PCC, indicating better recognition of self-contact points.

Implications

ProsePose demonstrates that LMMs can be effectively used to guide 3D human pose optimization, leveraging the semantic understanding embedded within these models. This approach can be applied without additional training, making it a practical solution for scenarios with limited access to annotated data.

Theoretically, this work highlights the potential of LMMs in understanding and reasoning about physical interactions from textual data. Practically, it provides a flexible framework for improving pose estimation in diverse applications, including human-computer interaction, animation, and robotics, where precise capturing of human poses and contacts is crucial.

Future Directions

While ProsePose has shown promising results, the reliance on LMMs introduces potential issues such as hallucination and bias toward commonly represented poses in the training data. Future developments could explore:

Fine-tuning LMMs specifically for pose estimation tasks.
Integrating additional priors or constraints to mitigate hallucination effects.
Extending the method to more complex interactions involving more than two individuals.

Overall, this approach opens new avenues for enhancing pose estimation frameworks by incorporating rich semantic priors from LLMs, suggesting a broader utility of LMMs in computer vision and pose estimation tasks.

Markdown Report Issue