CameraHMR: Perspective-Aware 3D Recovery

Updated 2 February 2026

CameraHMR is a framework that leverages full perspective projection to accurately model foreshortening, wide-angle distortion, and off-axis subjects in 3D human recovery.
It integrates modules such as HumanFoV and dense keypoint detection to regress camera intrinsics and enhance SMPL fitting, significantly improving 2D alignment and mesh accuracy.
Variants like SynCHMR and W-HMR extend the approach by combining SLAM, multi-region analysis, and denoising to achieve robust, metric-scale reconstructions in dynamic environments.

CameraHMR is a framework and family of methods designed for accurate 3D human pose, shape, and camera parameter recovery from monocular images and videos, with particular emphasis on resolving depth ambiguities, realistic perspective projection, and tight camera-human-scene coupling. The term “CameraHMR” refers both to a specific model family (notably, CameraHMR in “Aligning People with Perspective” (Patel et al., 2024)) and a paradigm shift where camera intrinsics and extrinsics are jointly estimated and directly incorporated into the HMR pipeline. CameraHMR methods leverage perspective-aware architectures, field-of-view regression, dense surface keypoints, multi-stage calibration, and global-space optimization to address limitations of prior weak-perspective and isolated HMR approaches.

1. Methodological Foundations and Camera Modeling

CameraHMR replaces the weak-perspective camera with a full perspective pinhole projection, allowing precise modeling of foreshortening, wide-angle distortion, and off-axis subjects. In CameraHMR (Patel et al., 2024), the camera intrinsics (focal length $f$ , principal point $(c_x, c_y)$ ) are regressed via a HumanFoV module, a HRNet-based network that predicts the vertical field of view $\nu$ from images, yielding

$f_y = \frac{H}{2\cdot \tan(\nu/2)},$

where $H$ is image height. A bounding-box token $T_{\mathrm{bbox}} = ( c_x / f, c_y / f, s / f )$ encapsulates spatial context. Predicted intrinsics are injected into both model inference and SMPLify fitting, enabling perspective reprojection:

$\begin{pmatrix} u \ v \end{pmatrix} = \frac{1}{Z}\begin{pmatrix} fX + c_x Z \ fY + c_y Z \end{pmatrix}$

This modeling contrasts with weak-perspective approaches that condense camera effects into a scale and translation, sacrificing realism for computational convenience.

2. Field-of-View Regression and Intrinsic Estimation

CameraHMR integrates HumanFoV, a dedicated module to estimate the vertical field of view from “in-the-wild” person images. HumanFoV is trained on 500k Flickr images tagged with EXIF FocalLengthIn35mmFormat, using an asymmetric squared error to emphasize overestimation penalties. Evaluation across datasets (SPEC, 3DPW, EMDB, BEDLAM-Z) demonstrates HumanFoV's superior generalization, with FoV prediction error ranging from 5.0° to 7.9°, outperforming prior cam-calibration networks. The use of regressed intrinsics in fitting and inference results in substantial improvements in 2D alignment, MPJPE, and PVE compared to fixed or default focal lengths.

3. SMPLify Enhancement with Perspective and Dense Keypoints

CameraHMR modifies SMPLify by incorporating both regressed camera intrinsics and a dense set of surface keypoints detected by a ViTPose-based module trained on BEDLAM. The fitting energy minimized over SMPL parameters $(\theta, \beta)$ and translation $t^{\mathrm{full}}$ is

$E(\beta,\theta,t^{\mathrm{full}}) = \lambda_J \| \Pi_K(J_{3d}(\theta,\beta) + t^{\mathrm{full}}) - J_{2d}^{\mathrm{gt}} \|_2^2 + \lambda_S \| \Pi_K(S_{3d}(\theta,\beta) + t^{\mathrm{full}}) - S_{2d}^{\mathrm{gt}} \|_2^2 + \lambda_\beta \|\beta\|_2^2 + \lambda_{\mathrm{int}} \|V(\theta,\beta) - V_{\mathrm{init}}\|_2^2$

Dense surface keypoints (138 per crop, chosen for curvature and coverage) provide richer geometric constraints than sparse joints, addressing “average-looking body” pathology. Bootstrapping progressive iterations of pseudo-ground-truth (pGT) using improved camera and keypoint estimation yield continual gains in model realism and quantitative metrics.

4. CameraHMR Variants and Multi-Component Pipelines

Other CameraHMR-related approaches include:

SynCHMR (Zhao et al., 2024): A two-stage system coupling Human-aware Metric SLAM (using camera-frame HMR meshes as priors for SLAM scale, depth calibration, and masking) and a Scene-aware SMPL Denoiser (Transformer denoiser conditioned on spatio-temporal scene encoding). This framework solves scale and dynamic ambiguities in SLAM and yields metric-scale reconstructions of cameras, dense 3D point clouds, and human meshes in a unified global frame.
W-HMR (Yao et al., 2023): Integrates a weakly-supervised calibration branch regressing focal length via body distortion cues (eliminating need for precise focal labels), plus an OrientCorrect module that decouples pose and orientation learning—ensuring world-space mesh recovery is robust to cam-calibration errors.
Multi-RoI HMR (Nie et al., 2024): Estimates multiple local camera parameters from diverse region crops and imposes analytic camera consistency and contrastive losses, reducing camera-mesh ambiguity and improving mesh accuracy for each crop.
DiffOpt (Heo et al., 2024): Applies a motion diffusion prior within a multi-stage optimization loop to disentangle human and camera motion on dynamic video sequences, using SLAM for camera motion initialization and neural MLP fields for temporally smooth latent trajectories.

5. Optimization, Training, and Implementation Practices

CameraHMR and derivatives employ multi-stage training curricula. Initial models are trained on synthetic (AGORA) and multi-view datasets (BEDLAM), then iteratively improved by pseudo-label bootstrapping. Typical architectures include ViTPose or HRNet backbones (often pretrained on COCO), Transformer decoders for joint token fusion, and MLP regression for camera parameters. SMPL parameter heads output 6D rotations for pose, 10D shape, and optionally 3D joint or mesh offsets.

Losses combine: perspective 2D reprojection, dense keypoint alignment, SMPL regularization, orientation correction (in decoupled fashion, e.g., in W-HMR), and analytic camera consistency (in Multi-RoI). Optimizers span Adam and AdamW, using batch sizes of 16–128, and training durations range from 5 days (CameraHMR with 8 × V100 GPU) to 1 week for larger curricula.

6. Evaluation, Benchmarks, and Quantitative Comparisons

Benchmarking is performed on datasets such as COCO (PCK alignment), 3DPW (MPJPE, PA-MPJPE, PVE), EMDB, SPEC-SYN, RICH, and SSP-3D (shape accuracy). CameraHMR achieves best-in-class metrics:

Method	3DPW PA-MPJPE	EMDB MPJPE	SPEC-SYN PVE
CLIFF [25]	43.0	103.5	139.0
HMR2.0a [24]	44.4	97.8	133.3
TokenHMR [26]	43.8	88.1	110.5
CameraHMR (Ours)	38.7	73.2	79.1

In SSP-3D shape evaluation: CameraHMR (BEDLAM+4DH) achieves PVE-T-SC of 11.6 mm, outperforming SHAPY (19.2 mm) and STRAPS (15.9 mm). Perspective-aware methods yield large gains for wide-angle and heavily distorted images, with HumanFoV-predicted focal enhancing PVE by up to 37 mm on SPEC-SYN.

7. Synergistic Human-Mesh/Camera/Scene Recovery and Applications

CameraHMR frameworks such as SynCHMR demonstrate tight coupling between metric SLAM, human mesh recovery, and scene reconstruction. The system calibrates monocular depths using camera-frame human priors, then masks dynamic regions in bundle adjustment, and finally learns world-frame SMPL meshes via scene-conditioned denoising. The resultant pipeline reconstructs consistent camera trajectories, human motion, and point clouds in a unified metric world frame, enabling robust downstream AR/VR, sports, and clinical applications.

CameraHMR enables recovery from monocular inputs in uncontrolled settings, supporting dynamic camera motion (DiffOpt (Heo et al., 2024), SynCHMR (Zhao et al., 2024)), automatic camera calibration (HumanFoV, W-HMR), and multi-view simulation via dense keypoints. These advances drive the field toward in-the-wild, perspective-correct human mesh recovery, consistently outperforming weak-perspective and camera-agnostic baselines.