Papers
Topics
Authors
Recent
Search
2000 character limit reached

HandMCM: Correspondence Mamba for Hand Pose

Updated 9 February 2026
  • The paper introduces HandMCM, a novel state-space model that fuses multi-modal inputs for accurate 3D hand keypoint localization.
  • It employs techniques like bidirectional gated state-space modeling and local token injection to enhance precision under occlusion.
  • Empirical evaluations on NYU, DexYCB, and HO3D datasets demonstrate significant improvements over existing methods in keypoint error reduction.

HandMCM, in the context of 3D hand pose estimation, is a multi-modal point cloud–based correspondence state space model designed for robust and accurate 3D hand keypoint localization, particularly under occlusion and in challenging scene conditions. HandMCM innovatively adapts the Mamba state-space model (SSM) architecture, introduces dynamic correspondence modeling, and fuses multi-modal features (RGB, depth, point cloud) to address the intrinsic complexity of hand articulation and keypoint visibility inherent to human-computer interaction tasks (Cheng et al., 2 Feb 2026).

1. Architectural Overview and Model Formulation

HandMCM processes three input modalities: a single-hand RGB image RRH×W×3R \in \mathbb{R}^{H \times W \times 3}, a depth image DRH×WD \in \mathbb{R}^{H \times W}, and a point cloud PRN×3P \in \mathbb{R}^{N \times 3} (sampled from the depth map). It constructs a fused feature representation using a multi-modal super point encoder:

  • 3D downsampling: PointNet++-style set convolution reduces PP to N=N/2N' = N/2 super points PRN×3P' \in \mathbb{R}^{N' \times 3}, extracting geometric features FpRN×CpF_p \in \mathbb{R}^{N' \times C_p}.
  • 2D encoding: ResNet-based autoencoders generate FdRH/2×W/2×CdF_d \in \mathbb{R}^{H/2 \times W/2 \times C_d} from DD and FrgbRH/2×W/2×CrgbF_\text{rgb} \in \mathbb{R}^{H/2 \times W/2 \times C_\text{rgb}} from RR. Each is projected onto PP' (via nearest-neighbor/barycentric interpolation), yielding FdpF_{d\rightarrow p} and FrgbpF_{\text{rgb}\rightarrow p}.
  • Feature fusion: All modalities are concatenated F=[FpFdpFrgbp]F = [F_p \| F_{d\rightarrow p} \| F_{\text{rgb}\rightarrow p}].

Keypoint token extraction is achieved by aggregating FF into a global context vector GG, followed by a bias-induced layer (BIL) that instantiates JJ learnable keypoint tokens X0X_0 (each with a per-keypoint bias). An initial 3D keypoint estimate J0J_0 is obtained through a linear regression head.

The core processing occurs in KK stacked “correspondence Mamba” blocks, each refining the tokens via SSM and correspondence modeling. Each block applies a bidirectional gated SSM (BiGS), outputting a dynamically learned correspondence map that models spatial relationships and interactions across keypoints.

2. Dynamic Correspondence Modeling via Bidirectional State-Space Models

HandMCM reframes the traditional keypoint association problem by treating the ordered keypoint tokens as a 1D sequence (as opposed to a fixed graphical topology). Within each block kk:

  • Forward/Backward streams: A normalized input X~\widetilde{X} is projected via learned matrices. GELU activations and two separate projections yield XfX_f (forward) and XbX_b (backward), passed through dedicated SSM layers to produce UfU_f and UbU_b.
  • Correspondence map: The outer product UfReverse(Ub)U_f \otimes \text{Reverse}(U_b) is linearly projected to form McorrRJ×JM_\text{corr} \in \mathbb{R}^{J \times J}, controlling spatial mixing among tokens.
  • Token update: Value projections VV are mixed via the correspondence map to yield XkX_k (updated keypoint tokens) and projected to 3D coordinates Jk=XkWrJ_k = X_k W_r.

Local geometrical information is incorporated through a “local injection and filtering” mechanism. For each predicted keypoint, a local context is constructed from its KK nearest super points, concatenated with the keypoint’s global token and local geometric feature, and processed by a lightweight set-conv network. Local tokens are injected multiplicatively and combined through a learned gating mechanism, ensuring that each keypoint estimate is anchored to local evidence.

3. Multi-Modal Feature Integration and Projection

Comprehensive fusion of 2D and 3D cues is pivotal for occlusion robustness:

  • Point cloud super point features (FpF_p) are obtained by downsampling PP.
  • Depth (FdF_d) and RGB (FrgbF_\text{rgb}) features are derived from independent ResNet autoencoders.
  • 2D features are aligned with 3D space by projecting onto the super points corresponding to PP'.
  • The concatenated final feature per super point combines spatial, depth, and appearance cues, yielding a highly expressive representation for the SSM-driven keypoint regression.

4. Optimization, Loss Functions, and Training Protocol

HandMCM employs a block-wise loss defined as

L=k=0Kj=1JsmoothL1(jk,jjj)\mathcal{L} = \sum_{k=0}^K \sum_{j=1}^J \mathrm{smooth}_{L1}(j_{k,j} - j_{j}^*)

where smoothL1\mathrm{smooth}_{L1} is defined piecewise:

smoothL1(x)={0.5xx<0.01 x0.005otherwise\mathrm{smooth}_{L1}(x) = \begin{cases} 0.5|x| & |x| < 0.01 \ |x| - 0.005 & \text{otherwise} \end{cases}

The model is optimized using AdamW (β1=0.5,β2=0.999\beta_1 = 0.5, \beta_2 = 0.999) with a learning rate of 1×1031\times 10^{-3} and batch size 32. Extensive data augmentation is applied, including random 3D rotation (±180°), scaling ([0.9, 1.1]), and translation (±10 mm). Epoch schedules are tailored for each benchmark (NYU: 40 epochs, HO3D: 24, DexYCB: 20, with staged decay).

5. Empirical Evaluation and Ablation Analysis

HandMCM was evaluated on NYU (depth-only), DexYCB, and HO3D (both RGBD) datasets with the following setup:

Dataset Modalities J (Keypoints) Metric HandMCM Previous SOTA
NYU Depth 14 Mean keypoint error (mm) 7.06 7.12 (HandDAGT)
DexYCB RGBD 21 Avg. MKE over S0–S3 (mm) 6.67 7.54 (K-Fusion)
HO3D RGBD 21 Mean keypoint error (cm) 1.71 1.79 (K-Fusion)

HandMCM demonstrates state-of-the-art performance, outperforming previous graph-based and standard SSM methods, especially under severe occlusion. Qualitative and quantitative ablations indicate that the combination of correspondence SSM and local token injection/filtering yields the most significant gains (e.g., reduction from 8.47 mm to 7.06 mm on NYU). Increasing Mamba block depth up to three layers yields optimal results.

6. Algorithmic Workflow

A high-level pseudocode sketch illustrates the sequential processing pipeline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
P_prime, F_p = SetConv(P)
F_d = ResNetAutoencoder(D)
F_rgb = ResNetAutoencoder(R)
F_{d->p}, F_{rgb->p} = ProjectOntoSuperpoints(F_d, F_rgb, P_prime)
F = Concatenate([F_p, F_{d->p}, F_{rgb->p}])
G = SetConvGlobal(F)
X_0 = BIL(G, J)  # Initial keypoint tokens
J_0 = Linear(X_0)

for k in range(1, K+1):
    X = LayerNorm(X_{k-1})
    V, X_f, X_b = GELU(Proj(X)), GELU(W_f X), GELU(W_b Reverse(X))
    U_f, U_b = W_{u_f} SSM(X_f), W_{u_b} SSM(X_b)
    M_corr = W_c(OuterProduct(U_f, Reverse(U_b)))
    X_bar = M_corr * V
    for j in range(J):
        X_loc = LocalSetConv(j, P_prime, F)
        X = LayerNorm(X * X_loc)
        G_j = Sigmoid(X_loc)
        J_k[j] = (G_j * X_bar[j] + (1-G_j) * X_loc) @ W_r
ComputeLossAllBlocks()
BackpropagateAndUpdate()

7. Scientific Contributions, Limitations, and Prospects

HandMCM introduces the first correspondence-based adaptation of the Mamba SSM for 3D hand pose estimation, enabling robust dynamic modeling of keypoint relationships that flexibly adapts to severe occlusion. It advances the field through:

  • Dynamic modeling of hand keypoints via a bidirectional, scan-path SSM rather than a rigid graph.
  • Local information injection and filtering mechanisms, anchoring inference in per-keypoint geometry.
  • Multi-modal super point encoding, synthesizing depth, RGB, and 3D geometric cues for enhanced occlusion handling.

Noted limitations include the model's applicability to only single-hand or hand–object interaction scenarios; bimanual or closely interacting hands remain challenging. Prospective future work targets cross-hand correspondence modeling and efficient SSM variants for real-time AR/VR deployment (Cheng et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Correspondence Mamba.