OCHuman-Pose: Benchmark for Crowded Pose Estimation

Updated 28 January 2026

OCHuman-Pose is a comprehensive benchmark dataset for evaluating 2D and 3D human pose estimation in highly crowded and occluded scenes.
It augments the original OCHuman with over 50% more annotated persons and employs COCO-format annotations with continuous IoU distributions for realistic evaluation.
Empirical benchmarks using models like BMPv2 demonstrate significant AP improvements, underscoring the importance of precise 2D pose detection for reliable 3D human recovery.

OCHuman-Pose is a publicly available benchmark dataset designed for evaluation and development of 2D and 3D multi-person human pose estimation methods in highly crowded scenes. It serves as a direct superset and drop-in replacement for the standard OCHuman benchmark, massively expanding the set of annotated persons while retaining all underlying images and splits. OCHuman-Pose is specifically constructed to address the deficiencies in prior benchmarks with respect to occlusion, crowd density, and annotation completeness, providing an in-the-wild, COCO-style framework for evaluating pose estimation performance in real-world scenarios with extreme crowding and occlusion (Purkrabek et al., 21 Jan 2026).

1. Dataset Composition and Annotation Protocol

OCHuman-Pose is built on the exact same image set and "val"/"test" split structure as the original OCHuman release, but it augments these with annotations for all previously omitted person instances in every image. This augmentation increases the number of annotated persons by over 50%, while the number of images and split sizes remain unchanged.

Annotations strictly follow the COCO format. Each person instance is labeled with 17 keypoints (nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles) arranged as per the standard COCO skeleton graph. Keypoints have indepedent visibility flags $v_i \in \{0,1,2\}$ , with 0 indicating not labeled, 1 for labeled but occluded, and 2 for labeled and visible. Bounding boxes tightly enclose each individual. Segmentation masks persist only for the original OCHuman-annotated people, not for newly added instances; consequently, the extra person instances in OCHuman-Pose are available only for keypoint localization and bounding box evaluation.

Annotation Type	All Persons	Only Original OCHuman Subset
Keypoints (17+vᵢ)	✓	✓
Bounding Boxes	✓	✓
Segmentation Masks	–	✓

✓ = available, – = not available

2. Dataset Statistics and Scene Complexity

OCHuman-Pose is explicitly constructed to capture extreme crowding and occlusion. Unlike the original OCHuman, which only retained instances of no overlap (IoU = 0) or very high overlap (IoU > 0.5), OCHuman-Pose restores the full range of overlap including the previously missing midrange ( $0 < \mathrm{IoU} \leq 0.5$ ). The distribution of bounding-box intersections is therefore continuous and statistically representative of real crowded scenes, with both val and test splits exhibiting a mean IoU of 0.527 (vs. 0.545 in the original).

Typical scene complexity entails two or more persons per image, with frequent cases of multiple persons mutually occluding over half their bounding-box area, and keypoints that are often severely occluded or off-image. Pose and camera-viewpoint variation are unconstrained: persons can be oriented in any direction, perform arbitrary postures (e.g., crouching, sitting, crossed limbs), and interact non-trivially—scenarios that confound canonical top-down detectors.

A plausible implication is that OCHuman-Pose enables analyses under a broader and more challenging range of human-object and human-human interactions as compared to prior pose estimation benchmarks.

3. Evaluation Metrics and Protocols

OCHuman-Pose employs the COCO object keypoint similarity (OKS) metric for pose evaluation. Average precision (AP) and OKS are computed as:

$\mathrm{AP} = \int_0^1 p(r) \, dr$

$\mathrm{OKS} = \frac{\sum_i \exp\left(-\frac{d_i^2}{2s^2k_i^2}\right)\cdot \delta(v_i>0)}{\sum_i \delta(v_i>0)}$

where $d_i$ is the Euclidean distance between predicted and ground-truth keypoint $i$ , $s$ is the scale factor (square root of the enclosing bounding-box area), $k_i$ is a keypoint-specific normalization constant, and $v_i$ is the keypoint visibility flag. AP is computed as the mean of precision at OKS thresholds ranging from 0.50 to 0.95 in steps of 0.05. A detection qualifies as true positive if its OKS exceeds threshold and it does not duplicate a higher-scoring prediction.

The revised test protocol eliminates the prior "false-positive" bias of the original OCHuman, wherein predictions for unannotated (but obviously present) people were penalized. OCHuman-Pose ensures every person is labeled, so detection models are not erroneously penalized for identifying valid, previously unannotated individuals.

4. Mask Annotations and Evaluation Constraints

Segmentation masks for ground-truth evaluation exist only for the original OCHuman-annotated persons. The newly added persons in OCHuman-Pose have bounding boxes and keypoints only; thus, mask-based average precision cannot be computed for these instances. Keypoint and bounding-box evaluation are possible for all instances. This limitation distinguishes between original and newly annotated people and necessitates careful design of evaluation protocols, especially for mask AP computations and for assessing pose estimation models that leverage segmentation information.

5. Methodological Impact and Empirical Benchmarks

The availability of denser and more accurate annotations in OCHuman-Pose enables more reliable and sensitive benchmarking of pose estimation methodology, particularly for crowded scenes under severe occlusion. The integration of BBoxMaskPose v2 (BMPv2) and its enhanced mask-refinement module demonstrates a 3+ AP point gain over BMPv1 on OCHuman-Pose, achieving 82.4 pose-AP (val) and 83.4 (test); the BMPv2+ variant further increases these to 85.8 and 86.8, respectively.

On the original OCHuman splits, BMPv2 reaches 51.3 (val) and 51.5 (test) pose-AP, surpassing all prior models and exceeding 50 AP for the first time under these conditions. The refined OCHuman-Pose annotation schema reveals that 2D pose accuracy has a direct and dramatic effect on 3D human recovery: every point of 2D AP translates to an approximately one-point gain in 3D reprojection-AP for methods such as SAM-3D-Body. For example, SAM-3D-Body achieves 85.2 reprojection-AP when prompted by BMPv2 masks, compared to 74.9 with BMPv1 and only ~69.5 with vanilla detector boxes (Purkrabek et al., 21 Jan 2026). This demonstrates that precise 2D pose estimation is a critical limiting factor for downstream multi-person 3D reconstruction in heavily crowded scenarios.

6. Public Release and Data Accessibility

OCHuman-Pose is available at https://MiraPurkrabek.github.io/BBox-Mask-Pose/, where JSON annotation files for both validation and test splits can be downloaded. The distribution includes example visualizations and conversion tools for the COCO format. The dataset, together with the BMPv2 code, pretrained models, and related evaluation scripts, provides an in-the-wild standard for advancing pose estimation research under conditions of extreme crowd density and occlusion.

Markdown Report Issue Upgrade to Chat

References (1)

BBoxMaskPose v2: Expanding Mutual Conditioning to 3D (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OCHuman-Pose Dataset.