CameraHMR: Monocular Human Mesh Recovery

Updated 2 February 2026

CameraHMR is a family of techniques for monocular human mesh recovery that jointly optimizes human pose and camera parameters for metric-scale 3D reconstructions.
It integrates full-perspective models through direct intrinsics estimation, weak supervision, and dense ray-map encoding, often combined with SLAM/VO for scale recovery.
Empirical results demonstrate enhanced mesh realism, faster inference, and robust performance under wide-field, off-center, and moving-camera conditions.

The CameraHMR model designates a family of approaches for monocular human mesh recovery (HMR) that explicitly models camera intrinsics and extrinsics, aligns people with the real imaging geometry, and yields accurate metric-scale 3D pose and shape in camera or world coordinates. Unlike weak-perspective or scale-agnostic HMR, CameraHMR approaches use learned or inferred full-perspective camera models and jointly optimize (or regress) for both person- and camera-related parameters, often in conjunction with SLAM or VO modules to recover absolute scene scale. These models produce improved mesh realism, global trajectory fidelity, and more robust 3D understanding, especially under wide-field-of-view, off-center, or moving-camera scenarios.

1. CameraHMR Architectural Paradigms

CameraHMR architectures are unified by their explicit modeling of the camera, but implementations differ in their backbone, degree of end-to-end learning, and integration of SLAM/VO. Notable paradigms include:

Full-perspective Intrinsics Integration: CameraHMR (Patel et al., 2024) augments a ViT-based HMR2.0 network with direct field-of-view (FoV) estimation (HumanFoV), encoding per-image focal length $\upsilon$ as a learnable quantity, and introduces perspective projection throughout the pseudo-ground-truth (pGT) and model training/fitting pipeline.
Weakly-supervised Intrinsics Calibration: In W-HMR (Yao et al., 2023), camera intrinsics (focal length, rotation) are regressed from body distortion features via a weakly-supervised head, leveraging MSE losses across full/crop joint reprojection to supply geometric signals.
Ray-map Encoding: MetricHMR (Zhang et al., 11 Jun 2025) encodes camera intrinsics and bbox/crop geometry as a dense ray-map, feeding both image and ray-map tokens to a joint transformer regressor.
Camera-Motion Disentanglement and SLAM/VO Fusion: Multiple CameraHMR instances (e.g., HAC (Yang et al., 2024), DiffOpt (Heo et al., 2024), WHAC (Yin et al., 2024), SynCHMR (Zhao et al., 2024)) employ a two-stage or modular workflow. An HMR network recovers camera-frame SMPL pose/shape, while an independent monocular SLAM/VO provides up-to-scale camera poses, which are then scale-calibrated using human mesh outputs as virtual calibration targets.

Key elements are summarized in the following table:

Key Component	Example Models	Distinctive Mechanism
Intrinsics Est.	CameraHMR (Patel et al., 2024), W-HMR	Direct FoV/focal regressor; weak/full supervision
Perspective Fusion	MetricHMR (Zhang et al., 11 Jun 2025)	Ray-map encodes $K$ , crop geometry for transformer input
SLAM/VO Fusion	HAC, WHAC, SynCHMR	Mesh–SLAM scale alignment; human-aware metric SLAM
Motion Prior	DiffOpt (Heo et al., 2024)	Test-time diffusion optimization, neural-motion-fields

2. Camera Modeling and Calibration

CameraHMR models operate under the perspective projection model: $\begin{pmatrix}u\v\end{pmatrix} = \begin{pmatrix} f_x \frac{X}{Z} + c_x \ f_y \frac{Y}{Z} + c_y \end{pmatrix} , \quad K = \begin{pmatrix} f_x & 0 & c_x \ 0 & f_y & c_y \ 0 & 0 & 1 \end{pmatrix}$

Precise estimation of $f_x, f_y$ (pixel focal lengths) is essential for metric-accurate 3D recovery. CameraHMR (Patel et al., 2024) employs HumanFoV, a HRNet-MLP regressor trained on 500k EXIF-annotated images, to predict $\upsilon$ , the vertical field of view, converting it to focal length via

$f_y = \frac{H}{2 \tan(\upsilon / 2)}$

Camera tokens (bounding box geometry, focal length) are fused as additional transformer tokens for enhanced perspective awareness.

W-HMR (Yao et al., 2023) introduces a weakly-supervised calibration head that estimates $f$ based on body shape distortion, optimizing reprojection error simultaneously on tightly-cropped and full images rather than relying on labeled $f$ .

MetricHMR (Zhang et al., 11 Jun 2025) further demonstrates that without correct perspective modeling, 2D–3D ambiguities cannot be resolved up to global translation or scale.

3. Joint Human–Camera Parameter Recovery and Disentanglement

CameraHMR variants jointly recover human mesh and camera extrinsics/intrinsics, employing techniques such as:

Camera–Body Decoupling: Several architectures decouple human mesh (SMPL or SMPL-X) regression from camera parameter estimation, enabling stable training and error correction. For instance, W-HMR uses a two-stage decoupling with an OrientCorrect MLP to refine the global orientation after body and camera extraction (Yao et al., 2023).
Dense Keypoints and SMPLify Extensions: Dense surface keypoint detection, as introduced in CameraHMR (Patel et al., 2024), allows for more surface and pose constraints, reducing average-shape bias. CameraHMR modifies SMPLify (“CamSMPLify”) to fit both dense and sparse 2D constraints under predicted perspective intrinsics using

$E_j = \sum_{k=1}^{17}\|\Pi(J_{3d,k}+t;K)-\hat J_{2d,k}\|^2 , \quad E_k = \sum_{i=1}^{138}\|\Pi(S_{3d,i}+t;K)-\hat S_{2d,i}\|^2$

Neural Motion Fields and Temporal Optimization: DiffOpt (Heo et al., 2024) realizes temporal consistency and camera–human disentanglement by representing pose, orientation, and translation as neural functions of normalized time, optimizing them under both a diffusion prior and a robust reprojection loss.

4. Scale Recovery, SLAM/VO Fusion, and Global Trajectories

A fundamental challenge in monocular settings is recovery of absolute metric scale and the disambiguation of human and camera motion. CameraHMR methods embrace several synergistic techniques:

HAC: Human-as-Checkerboard Scale Calibration (Yang et al., 2024) calibrates monocular SLAM trajectories via ratio of SMPL-inferred foot-joint depth (absolute) to the SLAM’s nearest depth estimate, robustly aggregating over time and both feet: $s_{p,t} = \frac{d^A_{p,t}}{d^S_{p,t}} , \quad \bar{s} = \mathrm{median}\{s_{p,t}\}$ This corrects the SLAM scale, aligns camera and human pose, and enables world-coordinate mesh extraction through simple coordinate manipulations.
WHAC: Visual Odometry and Human Motion Fusion (Yin et al., 2024) regresses metric root-velocity in the canonical space via MotionVelocimeter, and fuses the VO trajectory (Umeyama alignment) and the human mesh-based estimate to resolve both camera and human world trajectories.
SynCHMR: Human-Aware Metric SLAM (Zhao et al., 2024) uses person-centric depth/size priors from camera-frame SMPL to metric-calibrate depth for SLAM, masks out dynamic human pixels, and then denoises world-frame SMPL sequences via a scene-aware transformer transformer conditioned on a dense point cloud.

All of these methods achieve robust global accuracy, substantially outperforming “local-to-global” optimization methods in efficiency and accuracy, and are robust to a range of camera and scene geometries.

5. Training Strategies, Loss Functions, and Datasets

CameraHMR models employ multi-stage and/or multi-modal training protocols:

Iterative pseudo-GT refinement: CameraHMR (Patel et al., 2024) alternates between CamSMPLify-based pGT generation under predicted $K$ and iterative re-training using expanded datasets including BDLEAM/AGORA and 4DHumans.
Losses: Integrated losses span 2D joint/vertex reprojection (with perspective geometry), 3D joint/vertex L2, SMPL parameter regression, camera parameter regression, and motion prior/temporal smoothness (DiffOpt (Heo et al., 2024)). Weak or loose supervision terms control camera-parameter head initialization.

Evaluation is conducted on benchmarks such as 3DPW, EMDB, SPEC-SYN, Human3.6M, MPI-INF-3DHP, and synthetic datasets (WHAC-A-Mole, SSP-3D), with MPJPE/PA-MPJPE, PVE, and global alignment metrics (W-MPJPE, WA-MPJPE, G-MPJPE) as core metrics.

Empirical highlights:

Model	3DPW PA-MPJPE (mm)	EMDB PA-MPJPE (mm)
CameraHMR (Patel et al., 2024)	38.5	43.7
MetricHMR (Zhang et al., 11 Jun 2025)	37.5	52.4
W-HMR (Yao et al., 2023)	—	78.3 (AGORA NMVE)
HAC/CameraHMR (Yang et al., 2024)	—	53.7 (PA-MPJPE)

6. Temporal Coherency and Scene-aware Regularization

Advanced CameraHMR frameworks integrate temporal and scene constraints for improved human motion recovery:

Diffusion/Transformer Priors: Motion diffusion-guided optimization (DiffOpt (Heo et al., 2024)) leverages a pretrained U-Net denoiser to regularize global human mesh sequences, enforcing biomechanically plausible, temporally smooth motion and reducing foot-sliding and drift.
Scene-aware Denoising: SynCHMR (Zhao et al., 2024) enhances world-frame mesh outputs by injecting dynamic scene point clouds into a multi-layer transformer denoiser, producing outputs with lower body–scene penetrations and improved contact/spatial plausibility.
Temporal Velocity Regression: WHAC (Yin et al., 2024) regresses metric root velocities to implicitly constrain global trajectories.

These innovations address issues of drift, contact realism, and camera–human disentanglement that typify real video sequences in the wild.

7. Performance Outcomes and Empirical Significance

Collectively, CameraHMR models set state-of-the-art performance for monocular 3D human mesh recovery under challenging camera conditions. Notable empirical properties include:

Substantial reduction in body/trajectory error compared to weak-perspective or optimization-heavy global recovery methods (e.g., in (Yang et al., 2024), WA-MPJPE and W-MPJPE improved by over 50% on EMDB2).
Orders-of-magnitude faster inference than optimization-based approaches when using calibration or world-recovery modules (e.g., HAC (Yang et al., 2024), SynCHMR (Zhao et al., 2024)).
Robustness against field-of-view drift, off-center captures, and dynamic-camera egocentric videos due to explicit camera conditioning.
Enhanced generalizability in world coordinates, as evidenced by transfer to synthetic and real benchmarks with high variation in camera parameters.

Ongoing extensions incorporate richer scene priors, physics-based constraints, IMU/event camera input, and full multi-view scalability, further expanding the scope and applicability of CameraHMR models in vision and graphics research.