Papers
Topics
Authors
Recent
Search
2000 character limit reached

V-HPOT: Virtual Hand Pose Optimisation

Updated 17 January 2026
  • V-HPOT is an innovative framework for egocentric 3D hand pose estimation using virtual camera depth normalization and self-supervised test-time optimization.
  • It reduces domain shift by transforming real camera parameters into a fixed virtual space, achieving up to 70.3% reduction in mean per-joint position error.
  • Its self-supervised re-optimization strategy adapts the model during inference with minimal computational overhead, ensuring high generalizability without target labels.

V-HPOT (“Virtual Camera-based Hand Pose Optimisation at Test-time”) is a framework for egocentric 3D hand pose estimation that achieves high cross-domain generalization without requiring labeled target-domain data. By introducing camera-agnostic depth normalization and a self-supervised test-time optimization strategy leveraging consistency in a virtual camera space, V-HPOT addresses domain shift caused by variation in camera intrinsics and scene conditions. The framework demonstrates pronounced improvements in mean per-joint position error (MPJPE) relative to previous art, establishing new baselines for generalizability and data efficiency in the egocentric hand pose estimation domain (Mucha et al., 10 Jan 2026).

1. Virtual Camera Space and Depth Normalization

1.1 Motivation

Single-image depth estimation for 3D hand pose is fundamentally dependent on the camera’s focal length (ff) and image height (HH), leading to overfitting when training and testing domains differ. V-HPOT mitigates this by mapping all predictions into a virtual camera parameterization with fixed focal length (fvf_v) and image height (HvH_v), decoupling the depth estimation from real-world camera parameters.

1.2 Formal Definition

Given an image-space (metric) depth zimgz_{\mathrm{img}} (in mm), the transformation to virtual-camera depth zvz_v is:

zv=zimg×fvf×HHv  .z_v = z_{\mathrm{img}} \times \frac{f_v}{f} \times \frac{H}{H_v} \;.

Alternatively, with s=f/Hs = f/H, sv=fv/Hvs_v = f_v/H_v:

zv=zimg×svsz_v = z_{\mathrm{img}} \times \frac{s_v}{s}

This operation normalizes depths, providing invariance to the physical camera’s parameters.

1.3 Impact

Training and inference in this virtual space enables depth predictions that are camera-agnostic. At test time, one can recover metric depths by inverting the transformation using the test camera’s intrinsics. This normalization is core to V-HPOT’s domain transfer capabilities.

2. Self-Supervised Test-Time Optimization in Virtual Space

2.1 Principle

In deployment, annotated target-domain data is often unavailable. V-HPOT introduces a self-supervised fine-tuning procedure for the network backbone based on “3D consistency loss,” operating entirely within the virtual camera space.

2.2 Depth Augmentation

For an initial 3D pose estimate PinitRJ×3\mathbf{P}^{\mathrm{init}} \in \mathbb{R}^{J\times 3}, nn random scale factors SiU(1.0,1.25)S_i \sim \mathcal{U}(1.0,1.25) are sampled. Augmented poses are generated by scaling the zvz_v coordinate:

PSi=[x,y,Sizv]\mathbf{P}^{\to S_i} = [x, y, S_i \cdot z_v]

2.3 Consistency Loss

The network head re-predicts each augmented pose, yielding PSi^\widehat{\mathbf{P}^{\to S_i}}. The 3D consistency loss is:

Lconsistency=i=1n  SiPinitPSi^1\mathcal{L}_{\mathrm{consistency}} = \sum_{i=1}^n \|\;S_i\,\mathbf{P}^{\mathrm{init}} - \widehat{\mathbf{P}^{\to S_i}}\|_1

Expanded over joints:

Lconsistency=i=1nj=1JSiPjinitPjSi^\mathcal{L}_{\mathrm{consistency}} = \sum_{i=1}^n \sum_{j=1}^J |\,S_i\,\mathbf{P}_j^{\mathrm{init}} - \widehat{\mathbf{P}_j^{\to S_i}}|

This formulation enables the network to correct for scale mismatches during inference, with no reliance on 3D ground truth.

3. Network Structure and Source-Domain Training

3.1 Model Architecture

  • Backbone: EfficientNetV2-S (ImageNet-pretrained), producing a spatial feature map FMRC×14×14F_M \in \mathbb{R}^{C\times14\times14}.
  • Upsamplers: Two (for left/right hands); each with four transposed convolutions (kernel 4×4, stride 2, padding 1), BatchNorm-ReLU activation, and final 1×1 convolution for heatmaps HL,RR21×112×112H_{L,R}\in\mathbb{R}^{21\times112\times112}.
  • 2D Keypoint Head: Argmax localization provides 2D keypoint coordinates P2DR21×2P^{2D}\in\mathbb{R}^{21\times2}.
  • Depth Head: Shallow MLP predicts per-joint virtual depths z^vR21\hat z_v\in\mathbb{R}^{21}.
  • Handedness Head: MLP outputs left/right presence probabilities hR2h\in\mathbb{R}^2.

3.2 Loss Functions

Supervised source-domain training uses a joint objective:

Ltrain=λxyLxy+λzvLzv+λdLd+λhLh\mathcal{L}_{\mathrm{train}} = \lambda_{xy}\,\mathcal{L}_{xy} + \lambda_{z_v}\,\mathcal{L}_{z_v} + \lambda_{d}\,\mathcal{L}_{d} + \lambda_{h}\,\mathcal{L}_{h}

Where:

  • Lxy\mathcal{L}_{xy}: 2D heatmap IoU loss.
  • Lzv\mathcal{L}_{z_v}: virtual depth L1 loss.
  • Ld\mathcal{L}_{d}: pseudo-depth L1 loss (uses DPT-Hybrid estimator outputs).
  • Lh\mathcal{L}_{h}: handedness cross-entropy.

Domain generalization is further promoted by virtual-camera depth, 2D/3D scaling and appearance augmentations, and an auxiliary pseudo-depth task.

4. Test-Time Adaptation Algorithm

4.1 Procedure

At inference in a new domain:

  1. Obtain initial 3D pose estimate Pinit\mathbf{P}^{\mathrm{init}}.
  2. For nn randomly sampled Si[1.0,1.25]S_i \in [1.0, 1.25], create depth-augmented poses.
  3. Run augmented poses through the heads, compute Lconsistency\mathcal{L}_{\mathrm{consistency}}.
  4. Update only the backbone via backpropagation (heads frozen), using SGD.
  5. Repeat for first FNF_N frames (e.g., 5%5 \% of the test set). Adaptation then ceases; normal inference resumes.

4.2 Reference Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
optimizer = SGD(M.backbone, lr=0.3, momentum=0.2)
for t in 1T:
    P_init = M(I_t)
    L_cons = 0
    for i in 1n:
        S_i = Uniform(1.0, 1.25)
        P_aug = [x, y, S_i*z_init]
        P_hat = M.heads(M.backbone_features(P_aug_features))
        L_cons += L1(S_i*P_init, P_hat)
    if t  F_N:
        optimizer.zero_grad()
        L_cons.backward()
        optimizer.step()
    # finally output M(I_t) with adapted backbone

Limiting adaptation to a small fraction of frames reduces computational overhead (approximately 20–30 ms step) and prevents drift on outliers.

5. Empirical Results and Comparative Analysis

5.1 Cross-Domain Benchmarks

Trained on the HOT3D dataset, V-HPOT demonstrates:

Dataset Baseline MPJPE V-HPOT MPJPE Reduction
H2O 179.6 mm 53.3 mm –70.3 %
AssemblyHands 297.7 mm 174.5 mm –41.4 %

5.2 Ablations

Method H2O MPJPE AssemblyHands MPJPE
No VC, no TTO 179.6 mm 297.7 mm
VC only 146.1 mm 302.3 mm
TTO only 209.6 mm 261.8 mm
VC + TTO (V-HPOT) 53.3 mm 174.5 mm

Notably, depth normalization and test-time optimization are synergistic; their combination yields greatest improvements. The most effective consistency loss is Lxyzn=2\mathcal{L}_{xyz}^{n=2} (combined 2D, depth; two augmentations).

5.3 Observational Outcomes

  • Post-adaptation, predicted wrist and finger depths align closely with ground truth.
  • Robustness degrades for extremely close hand poses or under severe lens distortion.

6. Operational Constraints and Prospective Advances

6.1 Limitations

  • Inputs with very small zvz_v (hands close to camera) are out-of-distribution and error-prone.
  • Monochrome or highly distorted target domains (e.g., AssemblyHands) may necessitate domain-specific augmentations.
  • TTO increases runtime by 20–30 ms per adaptation; restricting to 5%5\% test frames is practical.

6.2 Configurable Parameters

  • Virtual camera: Hv=720H_v=720, fv=512f_v=512.
  • Source training: SGD learning rate 0.1 (halved each 5 epochs, 50 epochs total).
  • TTO: SGD learning rate 0.3, momentum 0.2, n=2n=2 augmentations, adaptation window 5%5\% of test data.

6.3 Future Directions

  • Incorporating distortion-aware virtual camera mapping.
  • Meta-learning strategies to further expedite TTO.
  • Exploiting temporal consistency across sliding windows for smoother pose estimation.

By unifying depth normalization with self-supervised adaptation in virtual camera space, V-HPOT delivers considerable advances in cross-domain hand pose estimation efficiency and generalizability, achieving state-of-the-art results without labeled data from the target domain (Mucha et al., 10 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to V-HPOT.