BBoxMaskPose v2 (BMPv2)

Updated 28 January 2026

The paper introduces BMPv2, a multi-stage framework that integrates probabilistic, mask-conditioned 2D pose estimation with iterative mask refinement and a mutual conditioning loop.
It achieves state-of-the-art performance on COCO and OCHuman benchmarks, surpassing 50% AP on challenging datasets and boosting 3D pose quality.
The approach demonstrates that precise 2D keypoint inference and occlusion modeling are key to improving multi-person 3D pose estimation in crowded scenes.

BBoxMaskPose v2 (BMPv2) is a multi-stage pose estimation framework that advances multi-person 2D and 3D human pose estimation, particularly in scenes with significant person overlap. BMPv2 integrates a probabilistic, mask-conditioned 2D pose estimation head (PMPose), an enhanced mask refinement module based on Segment Anything Model (SAM), and a tightly coupled mutual conditioning loop. BMPv2 sets new state-of-the-art (SOTA) performance on both COCO and OCHuman benchmarks, and for the first time surpasses 50 average precision (AP) on OCHuman. The method also establishes that improvements in 2D pose quality produce commensurate gains in subsequent 3D pose estimation, as verified on the newly introduced OCHuman-Pose dataset (Purkrabek et al., 21 Jan 2026).

1. Model Architecture and Mutual Conditioning

The BMPv2 pipeline operates as a modular sequence consisting of three core components: (A) object detection, (B) 2D pose estimation with PMPose, and (C) mask refinement using SAM-pose2seg.

PMPose Head:

PMPose is a top-down, probabilistic, mask-conditioned 2D pose estimator. Inputs are image crops centered on detected boxes $B$ and binary masks $M \in \{0,1\}^{H\times W}$ . The backbone leverages ViT-S/B/L/H variants (following the ViTPose paradigm), producing feature maps $\Phi \in \mathbb{R}^{C\times H'\times W'}$ . Mask conditioning is realized through element-wise addition or multiplication of a learned embedding of $M$ to $\Phi$ . For each of the $K$ keypoints, the PMPose head outputs:

Heatmap $\hat H^i \in \Delta^{H'\times W'}$ (discrete probability over pixel locations)
Presence probability $\hat p^i_p \in (0,1)$
Visibility probability $\hat p^i_v \in (0,1)$
Predicted expected OKS $\widehat{\mathbb{E}}[\mathrm{OKS}]^i \in \mathbb{R}$

The full instance-level probabilistic output is: $p(\mathbf{K}, \mathbf{p}_p, \mathbf{p}_v \mid I, B, M) = \prod_{i=1}^K \left[\hat p^i_p \cdot \hat H^i(u_i, v_i) \cdot \hat p^i_v\right]$ with $(u_i, v_i) = \mathrm{argmax}_{(x,y)} \hat H^i(x,y)$ .

Mutual Conditioning Loop:

BMPv2 sequentially applies:

Detector: Ingests image and previous background mask, outputs a new box.
PMPose: Ingests image, current box, and previous mask, outputs keypoints and probabilities.
SAM-pose2seg: Refines the mask, using pose-guided prompts derived from PMPose outputs.

BMPv2+ adds an additional pass of pose and mask refinement for further accuracy gains. Conditioning each module on others’ predictions iteratively "un-collapses" overlapping instances, resolving ambiguities in crowded settings.

SAM-pose2seg Mask Refinement:

SAM-pose2seg is a specialized fine-tuning of SAM v2.1, with the image encoder frozen and decoder trained only on the “human” class. Mask prompts are generated using PMPose: the prompt is set to the joint of highest visibility probability; further prompts target joints within current error regions.

The mask segmentation loss (per mask) is: $\mathcal{L}_{\text{mask}} = -\sum_{x,y}\left[M_{x,y}\log\hat M_{x,y} + (1-M_{x,y})\log(1-\hat M_{x,y})\right]$

2. Training Objectives and Loss Formulation

PMPose Loss:

Given ground-truth heatmaps $H^{i,*}$ , presence $p_p^{i,*}$ , visibility $p_v^{i,*}$ , and OKS $\mathrm{OKS}^{i,*}$ , the PMPose learning objective is:

$\begin{aligned} \mathcal{L}_{\text{pose}} = & \sum_{i=1}^K \left[ -\sum_{x,y} H^{i,*}_{x,y} \log \hat H^i_{x,y} \right] \ &+ \lambda_p \sum_{i=1}^K \mathrm{BCE}(\hat p^i_p, p_p^{i,*}) \ &+ \lambda_v \sum_{i=1}^K \mathrm{BCE}(\hat p^i_v, p_v^{i,*}) \ &+ \lambda_o \sum_{i=1}^K (\widehat{\mathbb{E}}[\mathrm{OKS}]^i - \mathrm{OKS}^{i,*})^2 \end{aligned}$

Mask conditioning is injected via: $\Phi' = \Phi + W_M * M \quad \text{or} \quad \Phi' = \Phi \odot \sigma(U_M M)$ where $W_M$ and $U_M$ are learnable projections.

3D Regression Losses:

BMPv2’s outputs can be directly used as prompts for 3D mesh recovery models (e.g., SAM-3D-Body), predicting 3D joints $J^i_{3D}$ and shape codes $\beta, \theta$ : $\mathcal{L}_{3D} = \sum_{i=1}^K \|J^i_{3D} - J^{i,*}_{3D}\|_2^2 + \lambda_{\text{reg}}\|\beta\|^2 + \lambda_{\text{pose}}\|\theta\|^2$

3. Mask-Conditioned Prompting for 3D Pose Recovery

BMPv2 establishes a direct pipeline for using 2D mask- and pose-conditioned descriptors to drive 3D pose and shape estimation. The 3D module concatenates flattened full-image features $\Phi_\text{full}$ , boxes $B$ , masks $M$ , and all $K$ heatmaps into a single vector: $\mathbf z = \left[\mathrm{flatten}(\Phi_\text{full}), B, \mathrm{flatten}(M), \mathrm{flatten}(\{\hat H^i\}) \right] \to g_\theta(\mathbf z) \to J_{3D}, \beta, \theta$

Inclusion of mask prompts facilitates precise occlusion reasoning and boundary localization absent from box-only inputs. Reported experimental results show that on OCHuman-Pose, reprojection AP increases from approximately 75 (box prompt alone) to 85 when both mask and pose prompts are used. This suggests that mask conditioning is a principal driver of improvements in challenging crowd scenarios.

4. Empirical Performance on COCO and OCHuman

BMPv2 demonstrates substantial advances over prior art on standard pose estimation benchmarks. Key results, including comparisons to relevant baselines, are summarized below.

Model	OCHuman-test AP	OCHuman-Pose-test AP	COCO-val AP
ViTPose-B	37.5	67.8	76.4
MaskPose-B	46.6	78.0	76.8
PMPose-B	48.2	80.0	76.9
BMPv2 2 $\times$	51.5	83.4	78.8
BMPv2+ 2 $\times$	55.8	86.8	78.1

BMPv2 is the first method to exceed 50% AP on OCHuman-test. BMPv2+ achieves up to 4 points higher AP in exchange for additional compute, underscoring the effectiveness of extended mutual conditioning.

5. Analysis with OCHuman-Pose and Error Attribution

The OCHuman-Pose dataset provides a more comprehensive multi-person testbed by adding annotations covering over 50% more subjects. It closes the typical gap between COCO-style AP with detected boxes and ground-truth box-based AP, ensuring that benchmarking more accurately reflects method differences.

Analysis reveals that using RTMDet-L + ViTPose* as a baseline in 2D yields 3D reprojection AP near 76. When masks are refined via BMPv2 and supplied to SAM-3D-Body, 3D reprojection AP increases to around 85. Thus, in crowded settings, the BMP loop can mitigate detection errors from merged or missed boxes, shifting the performance bottleneck in 3D pose estimation to the quality of 2D pose keypoint association.

6. Significance and Implications

BMPv2 establishes that integrating probabilistic, mask-conditioned 2D pose heads with iterative mask refinement and tightly coupled module conditioning is highly effective for multi-person human pose estimation in crowded scenes. Improvements in 2D keypoint accuracy, rather than bounding box detection, are now the limiting factor for multi-person 3D pose estimation in complex images. A plausible implication is that future progress in the field will depend more on advancing 2D keypoint inference, association, and occlusion modeling than on purely detection-based approaches.

The released code, models, and the OCHuman-Pose dataset facilitate further research and deployment in both 2D and 3D multi-person understanding tasks (Purkrabek et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

BBoxMaskPose v2: Expanding Mutual Conditioning to 3D (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BBoxMaskPose v2 (BMPv2).