PMPose: Probabilistic Mask-Conditioned Pose Estimator

Updated 28 January 2026

PMPose is a probabilistic, mask-conditioned 2D pose estimator that models keypoint location, presence, occlusion, and quality through explicit posterior estimation.
It fuses instance segmentation masks with Vision Transformer features using a gating mechanism to enhance performance and border stability in crowded scenes.
Empirical evaluations show PMPose outperforms previous methods on challenging datasets like OCHuman while matching state-of-the-art results on COCO.

PMPose is a probabilistic, mask-conditioned top-down 2D pose estimator that integrates instance segmentation masks and a fully probabilistic formulation for keypoint modeling. Developed as the central 2D module for BBoxMaskPose v2 (BMPv2), PMPose improves pose estimation in crowded scenes without sacrificing performance in standard settings. By unifying the mask-conditioning mechanism of MaskPose with the posterior-based joint localization of ProbPose, PMPose models not only locations but also joint presence, occlusion, and expected localization quality within a single architecture (Purkrabek et al., 21 Jan 2026).

1. Mathematical Formulation and Probabilistic Modeling

PMPose replaces standard heatmap regression with an explicit posterior estimation for each keypoint, incorporating the following elements:

A $(K+1)$ -way per-pixel softmax for $K$ keypoints and an out-of-image (oob) bin, enabling the network to reason about missing or occluded joints.
For image crop $I$ , instance mask $M$ , and bounding box $B$ , the feature tensor is computed as

$F = \mathrm{ViT}(I, B) \odot \phi(M) \in \mathbb{R}^{H \times W \times C},$

where $\mathrm{ViT}$ represents the Vision Transformer backbone up to the last block, $\phi(M)$ is a convolutional embedding of the mask, and $\odot$ denotes pointwise feature gating.

Output heads from $F$ $F$ include:
- Softmax heatmap head: $\hat{p}_k(i, j)$ for each keypoint $k$ and position $(i, j)$ , including an oob class;
- Per-keypoint presence probability $p_k^{\mathrm{in}}$ ;
- Per-keypoint visibility score $v_k$ ;
- Per-keypoint expected OKS (object keypoint similarity) $\hat{e}_k$ .

The training objective is maximum-likelihood with the following loss components: $\mathcal{L} = \lambda_h\mathcal{L}_{\mathrm{heat}} + \lambda_p\mathcal{L}_{\mathrm{pres}} + \lambda_v\mathcal{L}_{\mathrm{vis}} + \lambda_e\mathcal{L}_{\mathrm{oks}},$ with cross-entropy/BCE/MSE criteria for the respective heads. The explicit inclusion of out-of-image and presence/visibility variables confers border stability and provides calibrated confidence estimates (Purkrabek et al., 21 Jan 2026).

2. Mask-Conditioning and Feature Fusion Mechanism

Instance segmentation masks are integrated via a mask encoder and gating mechanism:

The mask $M \in \{0, 1\}^{H \times W}$ is embedded using a three-layer convolutional encoder $\phi$ to match the spatial and channel dimensions of the final ViT output.
The feature fusion is realized by channel-wise scaling:

$F_{i,j,c} = F_{\mathrm{vit},i,j,c} \times [1 + \gamma_c \phi(M)_{i,j,c}],$

where $\gamma_c$ are learned scalars.

This approach enables the network to attend to mask boundaries and suppress background activations effectively, benefiting performance in crowded and overlapping scenarios without the need for more complex fusion blocks.

This design follows the principles established in MaskPose, providing a direct, low-complexity integration of mask information with ViT features (Purkrabek et al., 21 Jan 2026).

3. Network Architecture and Training Protocol

The PMPose architecture supports ViT-S/B/L/H backbones, with the mask-conditioned features split into four parallel output heads for the various estimated quantities:

Heatmap head: $1 \times 1$ convolution to produce $K+1$ logits per spatial location.
Presence and visibility heads: global average pooling, followed by a two-layer MLP and sigmoid activation.
Expected OKS head: shares architecture with the previous classifiers, regressing a real-valued OKS estimate.

Training is performed in three phases:

Pretraining on a composite dataset: COCO, AIC, and MPII.
CropAugment: random shifts and rescalings to teach the out-of-image head, applying border transparency ( $\alpha=0.25$ ).
Fine-tuning with mask perturbations (random dilation/erosion) for robustness against detection or segmentation errors.

Optimization uses AdamW (learning rate $10^{-4}$ , weight decay $10^{-2}$ , cosine schedule, batch 256, 8 GPUs, 90 epochs).

4. Empirical Performance and Ablation

PMPose demonstrates enhanced performance on benchmarks, notably in crowded scenarios:

On OCHuman (crowded), PMPose-B attains 48.2 AP, outperforming MaskPose-B (46.6) and ViTPose*-B (44.1).
On COCO (standard), PMPose-B matches 76.9 AP.
The model generalizes robustly to crowding without losing accuracy on standard datasets.

Ablation studies indicate that removing the probabilistic heads (falling back to MSE heatmap regression) reduces OCHuman AP by ~1.5 points, while eliminating mask-conditioning costs an additional ~0.8 points. This confirms that both probabilistic posterior modeling and mask conditioning are necessary for optimal performance in dense scenes (Purkrabek et al., 21 Jan 2026).

5. Role in BMPv2 and 3D Pose Estimation Workflow

Within BMPv2, PMPose's outputs are leveraged as follows:

Per-joint visibility ( $v_k$ ) ranks keypoints for SAM-pose2seg prompting, with visibility emerging as the optimal prompt selector.
Expected-OKS ( $\hat{e}_k$ ) is aggregated to compute pose-level confidence, supporting AP sorting and 2D non-max suppression, thereby replacing heuristic box scores.
Presence probabilities ( $p_k^\mathrm{in}$ ) determine joint suppression prior to segmentation prompting.

BMPv2 uses final instance masks and bounding boxes, produced by PMPose, to prompt the 3D model (e.g., SAM-3D-Body). The 3D backbone encodes the image globally; instance masks, box-pooled features, and pooled PMPose heatmaps are concatenated and fed to a multi-layer perceptron to regress SMPL parameters. Training of the 3D module applies both $\ell_2$ reprojection loss (on 2D joints from 3D output) and a shape prior regularizer.

Empirical evaluation shows that leveraging PMPose outputs in BMPv2 improves 3D pose reprojection AP from 39.9 to 46.4 on OCHuman with RTMDet-l, establishing that improvements in 2D pose quality, especially under occlusion, translate directly into enhanced 3D mesh recovery (Purkrabek et al., 21 Jan 2026).

6. Summary and Significance

PMPose represents an overview of mask-conditioned feature processing and probabilistic keypoint estimation, advancing the state of the art in pose estimation under crowded and occluded conditions. The architecture’s capacity to explicitly model missing joints, produce calibrated confidence measures, integrate segmentation masks, and propagate richer 2D signals into 3D estimation workflows distinguishes its approach from deterministic regression-based predecessors.

By achieving 48.2 AP on OCHuman and matching leading scores on COCO, PMPose underpins the BMPv2 pipeline’s performance gains, including becoming the first method to exceed 50 AP on OCHuman and facilitating advancements in multi-person 3D pose estimation robustness in the presence of severe occlusion (Purkrabek et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

BBoxMaskPose v2: Expanding Mutual Conditioning to 3D (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PMPose.