SAM 3D Body: Volumetric 3D Estimation

Updated 1 January 2026

SAM 3D Body (3DB) is a framework for volumetric 3D body estimation and segmentation that uses pose-aware designs and modular encoder-decoder architectures.
It leverages shared, frozen backbones with distinct decoder heads to produce temporally consistent, physically plausible joint and skeleton representations.
The approach incorporates latent space smoothing and contact-aware optimization to address challenges such as monocular depth ambiguity and foot sliding.

SAM 3D Body (3DB) refers to a volumetric extension of the Segment Anything Model (SAM) paradigm that targets structured, pose-aware human 3D body estimation and segmentation. It leverages foundation model principles, repurposing strong 2D or slice-wise encoders to power out-of-the-box, robot-ready, or clinically relevant 3D body representations. The term encompasses both engineering frameworks for real-time estimation from monocular images and research architectures for volumetric medical or anatomical segmentation.

1. Architectural Foundations and Model Parameterization

SAM 3D Body (3DB) is characterized by modular design leveraging shared, frozen image encoders and distinct decoder heads. The approach described in "World-Coordinate Human Motion Retargeting via SAM 3D Body" instantiates 3DB as a dual-branch model: a full-body branch and a hand branch, both sharing a backbone (e.g., ResNet, ViT) (Tu et al., 25 Dec 2025).

For each video frame $t$ , the model infers:

Shape parameters ( $β_{\text{shape},t}\in\mathbb{R}^{N_β}$ ) encoding identity;
Skeleton scale ( $γ_{\text{scale},t}\in\mathbb{R}$ ), parameterizing uniform bone-length scaling over the kinematic tree;
Latent codes ( $z_t^{\text{model}}, z_t^{\text{expr}}\in\mathbb{R}^d$ ) capturing pose and pose-dependent correctives;
Global root translation ( $T_{\text{root},t}\in\mathbb{R}^3$ ) and orientation ( $R_{\text{root},t}\in\mathrm{SO}(3)$ ).

Architectural separation ensures the same encoder supports distinct body and hand representations, each outputting Momentum HumanRig (MHR) parameters.

After per-frame inference, temporal consistency is enforced by averaging shape and scale parameters across tracked frames: $β_{\text{shape}}^{\text{final}} = \frac{1}{T}\sum_{t=1}^T β_{\text{shape},t}, \;\;\; γ_{\text{scale}}^{\text{final}} = \frac{1}{T}\sum_{t=1}^T γ_{\text{scale},t}$ These "locked" values guarantee invariant bone lengths throughout a motion track.

2. Intermediate Representation and Integration with Downstream Pipelines

3DB operates as a per-frame estimator, emitting parameters compatible with the Momentum Human Rig (MHR) representation. MHR decodes these parameters to a kinematic skeleton via linear blend skinning and pose-dependent corrective models, yielding joint positions $\{p_t^{(j)}\}$ and joint rotations $\{R_t^{(j)}\}$ (Tu et al., 25 Dec 2025). For fixed $(β_{\text{shape}}^{\text{final}}, γ_{\text{scale}}^{\text{final}})$ , the resulting skeleton exhibits robot-friendly, strictly constant bone lengths. This design circumvents the mesh instabilities typical in naïve pose estimation, supporting downstream robotic retargeting and physics-based optimization.

All subsequent smoothing, contact reasoning, and physics-informed optimization steps leverage this joint-level skeleton, not the original dense mesh. Temporal stability and differentiability are preserved by construction.

3. Temporal Smoothing and Latent Space Optimization

Given the naturally noisy and frame-uncorrelated estimates typical of per-frame vision pipelines, temporal coherence is established through a sliding-window optimization in the low-dimensional MHR latent subspace. The objective consists of:

Latent fidelity loss:

$\mathcal{L}_{\text{latent}} = \frac{1}{T} \sum_{t=1}^T \left(\|z_t^{\text{model}} - \hat{z}_t^{\text{model}}\|_2^2 + \|z_t^{\text{expr}} - \hat{z}_t^{\text{expr}}\|_2^2\right)$

Kinematic smoothness:

$v_t^{(j)} = p_{t+1}^{(j)} - p_{t}^{(j)};\;\; a_t^{(j)} = p_{t+2}^{(j)} - 2p_{t+1}^{(j)} + p_t^{(j)}$

$\omega_t^{(j)} = \mathrm{Log}(R_{t+1}^{(j)}R_t^{(j)\top});\;\; \alpha_t^{(j)} = \omega_{t+1}^{(j)}-\omega_t^{(j)}$

Each is penalized by a Charbonnier function $\rho(x; \beta, \varepsilon) = \sqrt{\beta x^2 + \varepsilon}$ .

Boundary consistency: Overlapping windows are stitched by enforcing alignment via $\mathcal{L}_{\text{bound}}$ .

Total per-window loss: $\mathcal{L}_{\text{total}} = \lambda_{\text{latent}}\cdot\mathcal{L}_{\text{latent}} + \mathcal{L}_{\text{smooth}} + \lambda_{\text{bound}}\cdot\mathcal{L}_{\text{bound}}$ Optimization is performed per window, and decoded outputs yield a temporally stable, physically plausible skeleton sequence.

4. Contact-Aware Root Trajectory and World-Coordinate Plausibility

Monocular vision induces global trajectory drift and anthropomorphically erroneous foot sliding. A soft foot-ground contact probability is introduced: $w_{\text{base}}^f = \exp{\left(-d_f^2/2\sigma_h^2\right)},\;\; \alpha^f = \text{softmax}(k_{\text{contact}}w_{\text{base}}^f),\;\; p_c^f = w_{\text{base}}^f\cdot\alpha^f$ where $d_f$ is the foot's height above ground.

The root translation $T_{\text{global}}(t)$ is optimized under energy terms for:

Foot sliding ( $\mathcal{L}_{\text{slide}}$ ): penalize velocity when contact should be present,
Penetration ( $\mathcal{L}_{\text{pen}}$ ): penalize sub-ground trajectories,
Contact ( $\mathcal{L}_{\text{contact}}$ ): encourage $d_f\to 0$ at contact,
Auxiliary camera prior ( $\mathcal{L}_{\text{aux}}$ ): keep the root in view when not in contact.

The aggregate objective: $E_{\text{ground}} = \lambda_{\text{phy}}(\mathcal{L}_{\text{slide}} + \mathcal{L}_{\text{pen}} + \mathcal{L}_{\text{contact}}) + \mathcal{L}_{\text{smooth}} + \lambda_{\text{aux}} \mathcal{L}_{\text{aux}}$

This is solved with Adam optimizer, fixing $T_{\text{global}}(1) = 0$ .

5. Retargeting for Embodied Humanoid Execution

The optimized MHR skeleton is ultimately mapped to a real robot (e.g., Unitree G1) via a two-stage kinematics-aware inverse kinematics pipeline (Tu et al., 25 Dec 2025):

Joint alignment: For each of 14 designated joints, compute

$R_{\text{final}} = R_{z-\text{up}}\cdot F \cdot R_{\text{MHR}} \cdot R_{\text{offset}}$

to enforce correspondence and align joint axes.

Height scaling: All joints are scaled so that the subject's height matches that of the robot's hip-to-head dimension.
Inverse kinematics: Two stages:
1. Base + end-effectors: Minimize $\sum_{\text{eff}} \|p_{\text{eff}}(q_1) - p_{\text{eff}}^*\|^2 + \lambda_1 \|q_1 - q_{\text{ref}}\|^2$ .
2. Intermediate joints: Fix base/end-effectors, then solve $\sum_j \|p_j(q_2) - p_j^*\|^2$ subject to joint limits.

Both stages use Jacobian-based or Levenberg–Marquardt solvers, yielding robot-ready, dynamically feasible kinematic trajectories.

6. Volumetric and Medical Extensions of SAM 3D Body

The foundational SAM 3D paradigm is not limited to articulated motion. Several works have pushed related architectures to whole-body medical segmentation:

SAM3D processes all slices with a frozen ViT encoder, aggregates per-slice features, and performs true 3D segmentation with a lightweight volumetric decoder. Mask prediction in the medical context optimizes a combination of Dice and cross-entropy losses, with performance competitive to UNETR++ and nnFormer, while using orders-of-magnitude fewer parameters and compute (Bui et al., 2023).
AutoProSAM and RefSAM3D employ efficient 3D adapters, hierarchical cross-modal prompting (e.g., via CLIP), and soft prompt generators or adapters, enabling both automatic and text-conditional full-body segmentation (Li et al., 2023, Gao et al., 2024).
CT-SAM3D proposes a promptable 3D transformer (ResT backbone), progressive spatial prompt encoding (PSAP), and cross-patch prompt (CPP) modules, achieving robust interactive segmentation of 107 anatomies, with real-time GPU operation and drastically reduced annotation burden (Guo et al., 2024).
Zero-shot SAM3D pipelines operate by slicing 3D volumes along multiple axes, prompting 2D SAMs with projections of user-drawn 3D polylines, and reconstituting the volume via 3D fusion and morphology. They offer competitive performance in medical settings without any model fine-tuning (Chan et al., 2024).

7. Applications, Limitations, and Future Directions

SAM 3D Body approaches enable:

Structure-aware motion retargeting for embodied agents and humanoid robots, free from complex SLAM or marker-based motion capture (Tu et al., 25 Dec 2025);
Volumetric segmentation of organs, tumors, or musculoskeletal structures from CT/MRI for radiotherapy planning, surgical navigation, and anatomical mapping (Bui et al., 2023, Guo et al., 2024, Gao et al., 2024);
Interactive and prompt-driven annotation in both research and clinical PACS settings (Li et al., 2023, Guo et al., 2024, Chan et al., 2024).

Limitations include:

Required memory and scaling complexity for very large volumes (necessitating windowing, selective encoding, or efficient adapter deployment) (Bui et al., 2023, Guo et al., 2024);
Heterogeneity in input contrast, slice thickness, and modality which can degrade performance on out-of-domain cases;
Incomplete support for multi-modal (e.g., fused PET-CT) or sub-organ granularity, though recent designs are extending toward unified text and point prompts as future work (Guo et al., 2024, Gao et al., 2024);
For motion applications, monocular depth ambiguity and foot sliding still require careful optimization and physically inspired constraints (Tu et al., 25 Dec 2025).

Emerging directions focus on class-conditional and multi-modal prompt encoders, hierarchical and context-aware representation fusion, automatic prompt distillation, and seamless extension to "segment everything 3D" under medical or robotics priors (Li et al., 2023, Gao et al., 2024).

In sum, SAM 3D Body designates a class of foundation-model-informed, volumetric body estimation frameworks, leveraging frozen 2D/3D backbones, prompt encoding innovations, and kinematics- or context-aware decoders. Applications span robot motion retargeting, clinical segmentation, and general-purpose full-body analysis, with consistently demonstrated efficiency, robustness, and extensibility across current literature (Tu et al., 25 Dec 2025, Bui et al., 2023, Guo et al., 2024, Li et al., 2023, Chan et al., 2024, Gao et al., 2024).