CUPS: Improving Human Pose-Shape Estimators with Conformalized Deep Uncertainty

Published 11 Dec 2024 in cs.CV and cs.RO | (2412.10431v1)

Abstract: We introduce CUPS, a novel method for learning sequence-to-sequence 3D human shapes and poses from RGB videos with uncertainty quantification. To improve on top of prior work, we develop a method to generate and score multiple hypotheses during training, effectively integrating uncertainty quantification into the learning process. This process results in a deep uncertainty function that is trained end-to-end with the 3D pose estimator. Post-training, the learned deep uncertainty model is used as the conformity score, which can be used to calibrate a conformal predictor in order to assess the quality of the output prediction. Since the data in human pose-shape learning is not fully exchangeable, we also present two practical bounds for the coverage gap in conformal prediction, developing theoretical backing for the uncertainty bound of our model. Our results indicate that by taking advantage of deep uncertainty with conformal prediction, our method achieves state-of-the-art performance across various metrics and datasets while inheriting the probabilistic guarantees of conformal prediction.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces CUPS, a framework that integrates conformal deep uncertainty to improve 3D human pose and shape estimation.
It employs a transformer-based architecture with ensemble augmentation and adversarial training to enhance mesh plausibility.
Empirical results demonstrate superior accuracy and certified uncertainty bounds on standard benchmarks compared to prior methods.

CUPS: Improving Human Pose-Shape Estimators with Conformalized Deep Uncertainty

The paper introduces CUPS, a novel framework for 3D human pose and shape estimation from monocular RGB video that integrates uncertainty quantification within the model using conformal prediction. The work addresses a critical gap in the field: while parametric human models and transformer-based estimators have enabled accurate mesh recovery from RGB video, robust, formalized uncertainty quantification—particularly under distribution shift and when data are non-exchangeable—remains underexplored.

Methodological Contributions

The CUPS framework builds upon a transformer-based human mesh reconstructor, adopting the GLoT (Global-to-Local Transformer) architecture. The core innovation lies in the introduction and end-to-end training of a Deep Uncertainty Function (DUF), which scores the plausibility of predicted 3D meshes. Several key components distinguish the proposed method:

Ensemble Augmentation During Training: Leveraging intrinsic randomness in the transformer encoder (random frame masking), the model augments each input with several proposals per example, enabling the DUF to learn to rank hypothesis quality.
Adversarial Training for the Uncertainty Score: The DUF is optimized both to discriminate ground-truth versus predicted SMPL parameters and to adversarially influence the pose-shape network, encouraging generation of more plausible meshes.
Non-Exchangeable Conformal Prediction: Recognizing the non-exchangeable nature of video data (e.g., temporal correlations), the model incorporates recent conformal prediction advances that allow using weighted quantiles for calibration. The DUF output, after calibration, sets a conformity threshold and enables per-prediction uncertainty sets.

Theoretical Analysis

CUPS develops a theoretical framework supporting the practical uncertainty bounds achieved by conformal calibration with non-exchangeable data. Two practical miscoverage gap bounds are derived:

Periodic Distribution Change Bound: Under assumptions of periodic data shifts (e.g., subjects or actions in video datasets), a decay-weighted quantile threshold yields provably small coverage gaps as long as enough recent calibration data follows the latest changepoint.
Beta Distribution Bound: By modeling the DUF output as a beta distribution, the miscoverage gap is bounded in terms of the change in proportions of conforming examples, further tightening theoretical guarantees in realistic settings.

Empirical Evaluation

Comprehensive experiments are conducted on standard benchmarks: 3DPW, Human3.6M, and MPI-INF-3DHP. CUPS achieves the best or competitive scores across all evaluated metrics:

Method	PA-MPJPE (↓)	MPJPE (↓)	MPVPE (↓)	Accel (↓)	Human3.6M PA-MPJPE (↓)
VIBE	57.6	91.9	-	25.4	53.3
TCMR	52.7	86.5	102.9	7.1	52.0
GLoT	50.6	80.7	96.3	6.6	46.3
CUPS	48.7	76.2	91.7	6.9	44.0

On 3DPW, CUPS outperforms the prior best (GLoT) by 1.9mm PA-MPJPE, 4.5mm MPJPE, and 4.6mm MPVPE.
In no-3DPW-training transfer, the improvement is again consistent, validating generalizability.

Ablation studies confirm the positive impact of training-time ensembling, adversarial DUF loss, and careful tuning of the uncertainty loss hyperparameter. The empirical coverage of uncertainty sets is within two percentage points of the target (1–α), notably exceeding vanilla conformal prediction when using weighted quantiles.

Practical Implementation and System Considerations

The model is trained using 16-frame sequences. Due to ensemble augmentation and adversarial scoring, resource requirements are nontrivial: with 20 ensemble proposals per data point, a V100 GPU with 20GB memory and a high-memory CPU workstation are necessary. Training proceeds for 100 epochs with a moderate initial learning rate and cosine decay.

For inference-time uncertainty quantification, Monte Carlo Dropout is utilized—sampled outputs are scored by the DUF, and conformal sets are constructed by thresholding against the calibrated conformity score. This empirically enables multi-hypothesis, uncertainty-aware mesh recovery with explicit statistical coverage guarantees.

Broader Implications and Future Directions

This paper makes strong claims: CUPS provides per-sample, theoretically certified uncertainty intervals for human mesh predictions, even when the training and test distributions may differ and video-based data are highly dependent. This is highly relevant for safety-critical domains such as robotics, AR/VR, and autonomous vehicles, where it is necessary to know when predictions can be trusted.

The methodological framework should generalize to other structured prediction settings with non-exchangeable data, provided a suitable nonconformity score can be learned and diversity in outputs is available. The reliance on multiple proposals per input could be mitigated by sampling techniques or architecture modifications in future work. Additionally, integrating joint-level uncertainties or combining with physics-based or anatomical constraints could further enhance applicability, particularly for out-of-distribution robustness.

Conclusion

CUPS advances the field by integrating calibrated, distribution-free uncertainty quantification tightly with deep sequence-to-sequence pose-shape estimators. Through rigorous empirical validation and theoretical analysis, the work establishes new performance and reliability standards for 3D human mesh recovery from monocular video. This approach sets a precedent for reliable deployment of deep geometric perception modules in real-world, safety-sensitive applications.