V-HPOT: Virtual Hand Pose Optimisation

Updated 17 January 2026

V-HPOT is an innovative framework for egocentric 3D hand pose estimation using virtual camera depth normalization and self-supervised test-time optimization.
It reduces domain shift by transforming real camera parameters into a fixed virtual space, achieving up to 70.3% reduction in mean per-joint position error.
Its self-supervised re-optimization strategy adapts the model during inference with minimal computational overhead, ensuring high generalizability without target labels.

V-HPOT (“Virtual Camera-based Hand Pose Optimisation at Test-time”) is a framework for egocentric 3D hand pose estimation that achieves high cross-domain generalization without requiring labeled target-domain data. By introducing camera-agnostic depth normalization and a self-supervised test-time optimization strategy leveraging consistency in a virtual camera space, V-HPOT addresses domain shift caused by variation in camera intrinsics and scene conditions. The framework demonstrates pronounced improvements in mean per-joint position error (MPJPE) relative to previous art, establishing new baselines for generalizability and data efficiency in the egocentric hand pose estimation domain (Mucha et al., 10 Jan 2026).

1. Virtual Camera Space and Depth Normalization

1.1 Motivation

Single-image depth estimation for 3D hand pose is fundamentally dependent on the camera’s focal length ( $f$ ) and image height ( $H$ ), leading to overfitting when training and testing domains differ. V-HPOT mitigates this by mapping all predictions into a virtual camera parameterization with fixed focal length ( $f_v$ ) and image height ( $H_v$ ), decoupling the depth estimation from real-world camera parameters.

1.2 Formal Definition

Given an image-space (metric) depth $z_{\mathrm{img}}$ (in mm), the transformation to virtual-camera depth $z_v$ is:

$z_v = z_{\mathrm{img}} \times \frac{f_v}{f} \times \frac{H}{H_v} \;.$

Alternatively, with $s = f/H$ , $s_v = f_v/H_v$ :

$z_v = z_{\mathrm{img}} \times \frac{s_v}{s}$

This operation normalizes depths, providing invariance to the physical camera’s parameters.

1.3 Impact

Training and inference in this virtual space enables depth predictions that are camera-agnostic. At test time, one can recover metric depths by inverting the transformation using the test camera’s intrinsics. This normalization is core to V-HPOT’s domain transfer capabilities.

2. Self-Supervised Test-Time Optimization in Virtual Space

2.1 Principle

In deployment, annotated target-domain data is often unavailable. V-HPOT introduces a self-supervised fine-tuning procedure for the network backbone based on “3D consistency loss,” operating entirely within the virtual camera space.

2.2 Depth Augmentation

For an initial 3D pose estimate $\mathbf{P}^{\mathrm{init}} \in \mathbb{R}^{J\times 3}$ , $n$ random scale factors $S_i \sim \mathcal{U}(1.0,1.25)$ are sampled. Augmented poses are generated by scaling the $z_v$ coordinate:

$\mathbf{P}^{\to S_i} = [x, y, S_i \cdot z_v]$

2.3 Consistency Loss

The network head re-predicts each augmented pose, yielding $\widehat{\mathbf{P}^{\to S_i}}$ . The 3D consistency loss is:

$\mathcal{L}_{\mathrm{consistency}} = \sum_{i=1}^n \|\;S_i\,\mathbf{P}^{\mathrm{init}} - \widehat{\mathbf{P}^{\to S_i}}\|_1$

Expanded over joints:

$\mathcal{L}_{\mathrm{consistency}} = \sum_{i=1}^n \sum_{j=1}^J |\,S_i\,\mathbf{P}_j^{\mathrm{init}} - \widehat{\mathbf{P}_j^{\to S_i}}|$

This formulation enables the network to correct for scale mismatches during inference, with no reliance on 3D ground truth.

3. Network Structure and Source-Domain Training

3.1 Model Architecture

Backbone: EfficientNetV2-S (ImageNet-pretrained), producing a spatial feature map $F_M \in \mathbb{R}^{C\times14\times14}$ .
Upsamplers: Two (for left/right hands); each with four transposed convolutions (kernel 4×4, stride 2, padding 1), BatchNorm-ReLU activation, and final 1×1 convolution for heatmaps $H_{L,R}\in\mathbb{R}^{21\times112\times112}$ .
2D Keypoint Head: Argmax localization provides 2D keypoint coordinates $P^{2D}\in\mathbb{R}^{21\times2}$ .
Depth Head: Shallow MLP predicts per-joint virtual depths $\hat z_v\in\mathbb{R}^{21}$ .
Handedness Head: MLP outputs left/right presence probabilities $h\in\mathbb{R}^2$ .

3.2 Loss Functions

Supervised source-domain training uses a joint objective:

$\mathcal{L}_{\mathrm{train}} = \lambda_{xy}\,\mathcal{L}_{xy} + \lambda_{z_v}\,\mathcal{L}_{z_v} + \lambda_{d}\,\mathcal{L}_{d} + \lambda_{h}\,\mathcal{L}_{h}$

Where:

$\mathcal{L}_{xy}$ : 2D heatmap IoU loss.
$\mathcal{L}_{z_v}$ : virtual depth L1 loss.
$\mathcal{L}_{d}$ : pseudo-depth L1 loss (uses DPT-Hybrid estimator outputs).
$\mathcal{L}_{h}$ : handedness cross-entropy.

Domain generalization is further promoted by virtual-camera depth, 2D/3D scaling and appearance augmentations, and an auxiliary pseudo-depth task.

4. Test-Time Adaptation Algorithm

4.1 Procedure

At inference in a new domain:

Obtain initial 3D pose estimate $\mathbf{P}^{\mathrm{init}}$ .
For $n$ randomly sampled $S_i \in [1.0, 1.25]$ , create depth-augmented poses.
Run augmented poses through the heads, compute $\mathcal{L}_{\mathrm{consistency}}$ .
Update only the backbone via backpropagation (heads frozen), using SGD.
Repeat for first $F_N$ frames (e.g., $5 \%$ of the test set). Adaptation then ceases; normal inference resumes.

4.2 Reference Pseudocode

optimizer = SGD(M.backbone, lr=0.3, momentum=0.2)
for t in 1…T:
    P_init = M(I_t)
    L_cons = 0
    for i in 1…n:
        S_i = Uniform(1.0, 1.25)
        P_aug = [x, y, S_i*z_init]
        P_hat = M.heads(M.backbone_features(P_aug_features))
        L_cons += L1(S_i*P_init, P_hat)
    if t ≤ F_N:
        optimizer.zero_grad()
        L_cons.backward()
        optimizer.step()
    # finally output M(I_t) with adapted backbone

Limiting adaptation to a small fraction of frames reduces computational overhead (approximately 20–30 ms step) and prevents drift on outliers.

5. Empirical Results and Comparative Analysis

5.1 Cross-Domain Benchmarks

Trained on the HOT3D dataset, V-HPOT demonstrates:

Dataset	Baseline MPJPE	V-HPOT MPJPE	Reduction
H2O	179.6 mm	53.3 mm	–70.3 %
AssemblyHands	297.7 mm	174.5 mm	–41.4 %

5.2 Ablations

Method	H2O MPJPE	AssemblyHands MPJPE
No VC, no TTO	179.6 mm	297.7 mm
VC only	146.1 mm	302.3 mm
TTO only	209.6 mm	261.8 mm
VC + TTO (V-HPOT)	53.3 mm	174.5 mm

Notably, depth normalization and test-time optimization are synergistic; their combination yields greatest improvements. The most effective consistency loss is $\mathcal{L}_{xyz}^{n=2}$ (combined 2D, depth; two augmentations).

5.3 Observational Outcomes

Post-adaptation, predicted wrist and finger depths align closely with ground truth.
Robustness degrades for extremely close hand poses or under severe lens distortion.

6. Operational Constraints and Prospective Advances

6.1 Limitations

Inputs with very small $z_v$ (hands close to camera) are out-of-distribution and error-prone.
Monochrome or highly distorted target domains (e.g., AssemblyHands) may necessitate domain-specific augmentations.
TTO increases runtime by 20–30 ms per adaptation; restricting to $5\%$ test frames is practical.

6.2 Configurable Parameters

Virtual camera: $H_v=720$ , $f_v=512$ .
Source training: SGD learning rate 0.1 (halved each 5 epochs, 50 epochs total).
TTO: SGD learning rate 0.3, momentum 0.2, $n=2$ augmentations, adaptation window $5\%$ of test data.

6.3 Future Directions

Incorporating distortion-aware virtual camera mapping.
Meta-learning strategies to further expedite TTO.
Exploiting temporal consistency across sliding windows for smoother pose estimation.

By unifying depth normalization with self-supervised adaptation in virtual camera space, V-HPOT delivers considerable advances in cross-domain hand pose estimation efficiency and generalizability, achieving state-of-the-art results without labeled data from the target domain (Mucha et al., 10 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Towards Egocentric 3D Hand Pose Estimation in Unseen Domains (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to V-HPOT.

V-HPOT: Virtual Hand Pose Optimisation

1. Virtual Camera Space and Depth Normalization

1.1 Motivation

1.2 Formal Definition

1.3 Impact

2. Self-Supervised Test-Time Optimization in Virtual Space

2.1 Principle

2.2 Depth Augmentation

2.3 Consistency Loss

3. Network Structure and Source-Domain Training

3.1 Model Architecture

3.2 Loss Functions

4. Test-Time Adaptation Algorithm

4.1 Procedure

4.2 Reference Pseudocode

5. Empirical Results and Comparative Analysis

5.1 Cross-Domain Benchmarks

5.2 Ablations

5.3 Observational Outcomes

6. Operational Constraints and Prospective Advances

6.1 Limitations

6.2 Configurable Parameters

6.3 Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

V-HPOT: Virtual Hand Pose Optimisation

1. Virtual Camera Space and Depth Normalization

1.1 Motivation

1.2 Formal Definition

1.3 Impact

2. Self-Supervised Test-Time Optimization in Virtual Space

2.1 Principle

2.2 Depth Augmentation

2.3 Consistency Loss

3. Network Structure and Source-Domain Training

3.1 Model Architecture

3.2 Loss Functions

4. Test-Time Adaptation Algorithm

4.1 Procedure

4.2 Reference Pseudocode

5. Empirical Results and Comparative Analysis

5.1 Cross-Domain Benchmarks

5.2 Ablations

5.3 Observational Outcomes

6. Operational Constraints and Prospective Advances

6.1 Limitations

6.2 Configurable Parameters

6.3 Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research