HandMCM: Correspondence Mamba for Hand Pose
- The paper introduces HandMCM, a novel state-space model that fuses multi-modal inputs for accurate 3D hand keypoint localization.
- It employs techniques like bidirectional gated state-space modeling and local token injection to enhance precision under occlusion.
- Empirical evaluations on NYU, DexYCB, and HO3D datasets demonstrate significant improvements over existing methods in keypoint error reduction.
HandMCM, in the context of 3D hand pose estimation, is a multi-modal point cloud–based correspondence state space model designed for robust and accurate 3D hand keypoint localization, particularly under occlusion and in challenging scene conditions. HandMCM innovatively adapts the Mamba state-space model (SSM) architecture, introduces dynamic correspondence modeling, and fuses multi-modal features (RGB, depth, point cloud) to address the intrinsic complexity of hand articulation and keypoint visibility inherent to human-computer interaction tasks (Cheng et al., 2 Feb 2026).
1. Architectural Overview and Model Formulation
HandMCM processes three input modalities: a single-hand RGB image , a depth image , and a point cloud (sampled from the depth map). It constructs a fused feature representation using a multi-modal super point encoder:
- 3D downsampling: PointNet++-style set convolution reduces to super points , extracting geometric features .
- 2D encoding: ResNet-based autoencoders generate from and from . Each is projected onto (via nearest-neighbor/barycentric interpolation), yielding and .
- Feature fusion: All modalities are concatenated .
Keypoint token extraction is achieved by aggregating into a global context vector , followed by a bias-induced layer (BIL) that instantiates learnable keypoint tokens (each with a per-keypoint bias). An initial 3D keypoint estimate is obtained through a linear regression head.
The core processing occurs in stacked “correspondence Mamba” blocks, each refining the tokens via SSM and correspondence modeling. Each block applies a bidirectional gated SSM (BiGS), outputting a dynamically learned correspondence map that models spatial relationships and interactions across keypoints.
2. Dynamic Correspondence Modeling via Bidirectional State-Space Models
HandMCM reframes the traditional keypoint association problem by treating the ordered keypoint tokens as a 1D sequence (as opposed to a fixed graphical topology). Within each block :
- Forward/Backward streams: A normalized input is projected via learned matrices. GELU activations and two separate projections yield (forward) and (backward), passed through dedicated SSM layers to produce and .
- Correspondence map: The outer product is linearly projected to form , controlling spatial mixing among tokens.
- Token update: Value projections are mixed via the correspondence map to yield (updated keypoint tokens) and projected to 3D coordinates .
Local geometrical information is incorporated through a “local injection and filtering” mechanism. For each predicted keypoint, a local context is constructed from its nearest super points, concatenated with the keypoint’s global token and local geometric feature, and processed by a lightweight set-conv network. Local tokens are injected multiplicatively and combined through a learned gating mechanism, ensuring that each keypoint estimate is anchored to local evidence.
3. Multi-Modal Feature Integration and Projection
Comprehensive fusion of 2D and 3D cues is pivotal for occlusion robustness:
- Point cloud super point features () are obtained by downsampling .
- Depth () and RGB () features are derived from independent ResNet autoencoders.
- 2D features are aligned with 3D space by projecting onto the super points corresponding to .
- The concatenated final feature per super point combines spatial, depth, and appearance cues, yielding a highly expressive representation for the SSM-driven keypoint regression.
4. Optimization, Loss Functions, and Training Protocol
HandMCM employs a block-wise loss defined as
where is defined piecewise:
The model is optimized using AdamW () with a learning rate of and batch size 32. Extensive data augmentation is applied, including random 3D rotation (±180°), scaling ([0.9, 1.1]), and translation (±10 mm). Epoch schedules are tailored for each benchmark (NYU: 40 epochs, HO3D: 24, DexYCB: 20, with staged decay).
5. Empirical Evaluation and Ablation Analysis
HandMCM was evaluated on NYU (depth-only), DexYCB, and HO3D (both RGBD) datasets with the following setup:
| Dataset | Modalities | J (Keypoints) | Metric | HandMCM | Previous SOTA |
|---|---|---|---|---|---|
| NYU | Depth | 14 | Mean keypoint error (mm) | 7.06 | 7.12 (HandDAGT) |
| DexYCB | RGBD | 21 | Avg. MKE over S0–S3 (mm) | 6.67 | 7.54 (K-Fusion) |
| HO3D | RGBD | 21 | Mean keypoint error (cm) | 1.71 | 1.79 (K-Fusion) |
HandMCM demonstrates state-of-the-art performance, outperforming previous graph-based and standard SSM methods, especially under severe occlusion. Qualitative and quantitative ablations indicate that the combination of correspondence SSM and local token injection/filtering yields the most significant gains (e.g., reduction from 8.47 mm to 7.06 mm on NYU). Increasing Mamba block depth up to three layers yields optimal results.
6. Algorithmic Workflow
A high-level pseudocode sketch illustrates the sequential processing pipeline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
P_prime, F_p = SetConv(P)
F_d = ResNetAutoencoder(D)
F_rgb = ResNetAutoencoder(R)
F_{d->p}, F_{rgb->p} = ProjectOntoSuperpoints(F_d, F_rgb, P_prime)
F = Concatenate([F_p, F_{d->p}, F_{rgb->p}])
G = SetConvGlobal(F)
X_0 = BIL(G, J) # Initial keypoint tokens
J_0 = Linear(X_0)
for k in range(1, K+1):
X = LayerNorm(X_{k-1})
V, X_f, X_b = GELU(Proj(X)), GELU(W_f X), GELU(W_b Reverse(X))
U_f, U_b = W_{u_f} SSM(X_f), W_{u_b} SSM(X_b)
M_corr = W_c(OuterProduct(U_f, Reverse(U_b)))
X_bar = M_corr * V
for j in range(J):
X_loc = LocalSetConv(j, P_prime, F)
X = LayerNorm(X * X_loc)
G_j = Sigmoid(X_loc)
J_k[j] = (G_j * X_bar[j] + (1-G_j) * X_loc) @ W_r
ComputeLossAllBlocks()
BackpropagateAndUpdate() |
7. Scientific Contributions, Limitations, and Prospects
HandMCM introduces the first correspondence-based adaptation of the Mamba SSM for 3D hand pose estimation, enabling robust dynamic modeling of keypoint relationships that flexibly adapts to severe occlusion. It advances the field through:
- Dynamic modeling of hand keypoints via a bidirectional, scan-path SSM rather than a rigid graph.
- Local information injection and filtering mechanisms, anchoring inference in per-keypoint geometry.
- Multi-modal super point encoding, synthesizing depth, RGB, and 3D geometric cues for enhanced occlusion handling.
Noted limitations include the model's applicability to only single-hand or hand–object interaction scenarios; bimanual or closely interacting hands remain challenging. Prospective future work targets cross-hand correspondence modeling and efficient SSM variants for real-time AR/VR deployment (Cheng et al., 2 Feb 2026).