V-HPOT: Virtual Hand Pose Optimisation
- V-HPOT is an innovative framework for egocentric 3D hand pose estimation using virtual camera depth normalization and self-supervised test-time optimization.
- It reduces domain shift by transforming real camera parameters into a fixed virtual space, achieving up to 70.3% reduction in mean per-joint position error.
- Its self-supervised re-optimization strategy adapts the model during inference with minimal computational overhead, ensuring high generalizability without target labels.
V-HPOT (“Virtual Camera-based Hand Pose Optimisation at Test-time”) is a framework for egocentric 3D hand pose estimation that achieves high cross-domain generalization without requiring labeled target-domain data. By introducing camera-agnostic depth normalization and a self-supervised test-time optimization strategy leveraging consistency in a virtual camera space, V-HPOT addresses domain shift caused by variation in camera intrinsics and scene conditions. The framework demonstrates pronounced improvements in mean per-joint position error (MPJPE) relative to previous art, establishing new baselines for generalizability and data efficiency in the egocentric hand pose estimation domain (Mucha et al., 10 Jan 2026).
1. Virtual Camera Space and Depth Normalization
1.1 Motivation
Single-image depth estimation for 3D hand pose is fundamentally dependent on the camera’s focal length () and image height (), leading to overfitting when training and testing domains differ. V-HPOT mitigates this by mapping all predictions into a virtual camera parameterization with fixed focal length () and image height (), decoupling the depth estimation from real-world camera parameters.
1.2 Formal Definition
Given an image-space (metric) depth (in mm), the transformation to virtual-camera depth is:
Alternatively, with , :
This operation normalizes depths, providing invariance to the physical camera’s parameters.
1.3 Impact
Training and inference in this virtual space enables depth predictions that are camera-agnostic. At test time, one can recover metric depths by inverting the transformation using the test camera’s intrinsics. This normalization is core to V-HPOT’s domain transfer capabilities.
2. Self-Supervised Test-Time Optimization in Virtual Space
2.1 Principle
In deployment, annotated target-domain data is often unavailable. V-HPOT introduces a self-supervised fine-tuning procedure for the network backbone based on “3D consistency loss,” operating entirely within the virtual camera space.
2.2 Depth Augmentation
For an initial 3D pose estimate , random scale factors are sampled. Augmented poses are generated by scaling the coordinate:
2.3 Consistency Loss
The network head re-predicts each augmented pose, yielding . The 3D consistency loss is:
Expanded over joints:
This formulation enables the network to correct for scale mismatches during inference, with no reliance on 3D ground truth.
3. Network Structure and Source-Domain Training
3.1 Model Architecture
- Backbone: EfficientNetV2-S (ImageNet-pretrained), producing a spatial feature map .
- Upsamplers: Two (for left/right hands); each with four transposed convolutions (kernel 4×4, stride 2, padding 1), BatchNorm-ReLU activation, and final 1×1 convolution for heatmaps .
- 2D Keypoint Head: Argmax localization provides 2D keypoint coordinates .
- Depth Head: Shallow MLP predicts per-joint virtual depths .
- Handedness Head: MLP outputs left/right presence probabilities .
3.2 Loss Functions
Supervised source-domain training uses a joint objective:
Where:
- : 2D heatmap IoU loss.
- : virtual depth L1 loss.
- : pseudo-depth L1 loss (uses DPT-Hybrid estimator outputs).
- : handedness cross-entropy.
Domain generalization is further promoted by virtual-camera depth, 2D/3D scaling and appearance augmentations, and an auxiliary pseudo-depth task.
4. Test-Time Adaptation Algorithm
4.1 Procedure
At inference in a new domain:
- Obtain initial 3D pose estimate .
- For randomly sampled , create depth-augmented poses.
- Run augmented poses through the heads, compute .
- Update only the backbone via backpropagation (heads frozen), using SGD.
- Repeat for first frames (e.g., of the test set). Adaptation then ceases; normal inference resumes.
4.2 Reference Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
optimizer = SGD(M.backbone, lr=0.3, momentum=0.2) for t in 1…T: P_init = M(I_t) L_cons = 0 for i in 1…n: S_i = Uniform(1.0, 1.25) P_aug = [x, y, S_i*z_init] P_hat = M.heads(M.backbone_features(P_aug_features)) L_cons += L1(S_i*P_init, P_hat) if t ≤ F_N: optimizer.zero_grad() L_cons.backward() optimizer.step() # finally output M(I_t) with adapted backbone |
Limiting adaptation to a small fraction of frames reduces computational overhead (approximately 20–30 ms step) and prevents drift on outliers.
5. Empirical Results and Comparative Analysis
5.1 Cross-Domain Benchmarks
Trained on the HOT3D dataset, V-HPOT demonstrates:
| Dataset | Baseline MPJPE | V-HPOT MPJPE | Reduction |
|---|---|---|---|
| H2O | 179.6 mm | 53.3 mm | –70.3 % |
| AssemblyHands | 297.7 mm | 174.5 mm | –41.4 % |
5.2 Ablations
| Method | H2O MPJPE | AssemblyHands MPJPE |
|---|---|---|
| No VC, no TTO | 179.6 mm | 297.7 mm |
| VC only | 146.1 mm | 302.3 mm |
| TTO only | 209.6 mm | 261.8 mm |
| VC + TTO (V-HPOT) | 53.3 mm | 174.5 mm |
Notably, depth normalization and test-time optimization are synergistic; their combination yields greatest improvements. The most effective consistency loss is (combined 2D, depth; two augmentations).
5.3 Observational Outcomes
- Post-adaptation, predicted wrist and finger depths align closely with ground truth.
- Robustness degrades for extremely close hand poses or under severe lens distortion.
6. Operational Constraints and Prospective Advances
6.1 Limitations
- Inputs with very small (hands close to camera) are out-of-distribution and error-prone.
- Monochrome or highly distorted target domains (e.g., AssemblyHands) may necessitate domain-specific augmentations.
- TTO increases runtime by 20–30 ms per adaptation; restricting to test frames is practical.
6.2 Configurable Parameters
- Virtual camera: , .
- Source training: SGD learning rate 0.1 (halved each 5 epochs, 50 epochs total).
- TTO: SGD learning rate 0.3, momentum 0.2, augmentations, adaptation window of test data.
6.3 Future Directions
- Incorporating distortion-aware virtual camera mapping.
- Meta-learning strategies to further expedite TTO.
- Exploiting temporal consistency across sliding windows for smoother pose estimation.
By unifying depth normalization with self-supervised adaptation in virtual camera space, V-HPOT delivers considerable advances in cross-domain hand pose estimation efficiency and generalizability, achieving state-of-the-art results without labeled data from the target domain (Mucha et al., 10 Jan 2026).