PMPose: Probabilistic Mask-Conditioned Pose Estimator
- PMPose is a probabilistic, mask-conditioned 2D pose estimator that models keypoint location, presence, occlusion, and quality through explicit posterior estimation.
- It fuses instance segmentation masks with Vision Transformer features using a gating mechanism to enhance performance and border stability in crowded scenes.
- Empirical evaluations show PMPose outperforms previous methods on challenging datasets like OCHuman while matching state-of-the-art results on COCO.
PMPose is a probabilistic, mask-conditioned top-down 2D pose estimator that integrates instance segmentation masks and a fully probabilistic formulation for keypoint modeling. Developed as the central 2D module for BBoxMaskPose v2 (BMPv2), PMPose improves pose estimation in crowded scenes without sacrificing performance in standard settings. By unifying the mask-conditioning mechanism of MaskPose with the posterior-based joint localization of ProbPose, PMPose models not only locations but also joint presence, occlusion, and expected localization quality within a single architecture (Purkrabek et al., 21 Jan 2026).
1. Mathematical Formulation and Probabilistic Modeling
PMPose replaces standard heatmap regression with an explicit posterior estimation for each keypoint, incorporating the following elements:
- A -way per-pixel softmax for keypoints and an out-of-image (oob) bin, enabling the network to reason about missing or occluded joints.
- For image crop , instance mask , and bounding box , the feature tensor is computed as
where represents the Vision Transformer backbone up to the last block, is a convolutional embedding of the mask, and denotes pointwise feature gating.
- Output heads from include:
- Softmax heatmap head: for each keypoint and position , including an oob class;
- Per-keypoint presence probability ;
- Per-keypoint visibility score ;
- Per-keypoint expected OKS (object keypoint similarity) .
The training objective is maximum-likelihood with the following loss components: with cross-entropy/BCE/MSE criteria for the respective heads. The explicit inclusion of out-of-image and presence/visibility variables confers border stability and provides calibrated confidence estimates (Purkrabek et al., 21 Jan 2026).
2. Mask-Conditioning and Feature Fusion Mechanism
Instance segmentation masks are integrated via a mask encoder and gating mechanism:
- The mask is embedded using a three-layer convolutional encoder to match the spatial and channel dimensions of the final ViT output.
- The feature fusion is realized by channel-wise scaling:
where are learned scalars.
- This approach enables the network to attend to mask boundaries and suppress background activations effectively, benefiting performance in crowded and overlapping scenarios without the need for more complex fusion blocks.
This design follows the principles established in MaskPose, providing a direct, low-complexity integration of mask information with ViT features (Purkrabek et al., 21 Jan 2026).
3. Network Architecture and Training Protocol
The PMPose architecture supports ViT-S/B/L/H backbones, with the mask-conditioned features split into four parallel output heads for the various estimated quantities:
- Heatmap head: convolution to produce logits per spatial location.
- Presence and visibility heads: global average pooling, followed by a two-layer MLP and sigmoid activation.
- Expected OKS head: shares architecture with the previous classifiers, regressing a real-valued OKS estimate.
Training is performed in three phases:
- Pretraining on a composite dataset: COCO, AIC, and MPII.
- CropAugment: random shifts and rescalings to teach the out-of-image head, applying border transparency ().
- Fine-tuning with mask perturbations (random dilation/erosion) for robustness against detection or segmentation errors.
Optimization uses AdamW (learning rate , weight decay , cosine schedule, batch 256, 8 GPUs, 90 epochs).
4. Empirical Performance and Ablation
PMPose demonstrates enhanced performance on benchmarks, notably in crowded scenarios:
- On OCHuman (crowded), PMPose-B attains 48.2 AP, outperforming MaskPose-B (46.6) and ViTPose*-B (44.1).
- On COCO (standard), PMPose-B matches 76.9 AP.
- The model generalizes robustly to crowding without losing accuracy on standard datasets.
Ablation studies indicate that removing the probabilistic heads (falling back to MSE heatmap regression) reduces OCHuman AP by ~1.5 points, while eliminating mask-conditioning costs an additional ~0.8 points. This confirms that both probabilistic posterior modeling and mask conditioning are necessary for optimal performance in dense scenes (Purkrabek et al., 21 Jan 2026).
5. Role in BMPv2 and 3D Pose Estimation Workflow
Within BMPv2, PMPose's outputs are leveraged as follows:
- Per-joint visibility () ranks keypoints for SAM-pose2seg prompting, with visibility emerging as the optimal prompt selector.
- Expected-OKS () is aggregated to compute pose-level confidence, supporting AP sorting and 2D non-max suppression, thereby replacing heuristic box scores.
- Presence probabilities () determine joint suppression prior to segmentation prompting.
BMPv2 uses final instance masks and bounding boxes, produced by PMPose, to prompt the 3D model (e.g., SAM-3D-Body). The 3D backbone encodes the image globally; instance masks, box-pooled features, and pooled PMPose heatmaps are concatenated and fed to a multi-layer perceptron to regress SMPL parameters. Training of the 3D module applies both reprojection loss (on 2D joints from 3D output) and a shape prior regularizer.
Empirical evaluation shows that leveraging PMPose outputs in BMPv2 improves 3D pose reprojection AP from 39.9 to 46.4 on OCHuman with RTMDet-l, establishing that improvements in 2D pose quality, especially under occlusion, translate directly into enhanced 3D mesh recovery (Purkrabek et al., 21 Jan 2026).
6. Summary and Significance
PMPose represents an overview of mask-conditioned feature processing and probabilistic keypoint estimation, advancing the state of the art in pose estimation under crowded and occluded conditions. The architecture’s capacity to explicitly model missing joints, produce calibrated confidence measures, integrate segmentation masks, and propagate richer 2D signals into 3D estimation workflows distinguishes its approach from deterministic regression-based predecessors.
By achieving 48.2 AP on OCHuman and matching leading scores on COCO, PMPose underpins the BMPv2 pipeline’s performance gains, including becoming the first method to exceed 50 AP on OCHuman and facilitating advancements in multi-person 3D pose estimation robustness in the presence of severe occlusion (Purkrabek et al., 21 Jan 2026).