HandMCM: Correspondence Mamba for Hand Pose

Updated 9 February 2026

The paper introduces HandMCM, a novel state-space model that fuses multi-modal inputs for accurate 3D hand keypoint localization.
It employs techniques like bidirectional gated state-space modeling and local token injection to enhance precision under occlusion.
Empirical evaluations on NYU, DexYCB, and HO3D datasets demonstrate significant improvements over existing methods in keypoint error reduction.

HandMCM, in the context of 3D hand pose estimation, is a multi-modal point cloud–based correspondence state space model designed for robust and accurate 3D hand keypoint localization, particularly under occlusion and in challenging scene conditions. HandMCM innovatively adapts the Mamba state-space model (SSM) architecture, introduces dynamic correspondence modeling, and fuses multi-modal features (RGB, depth, point cloud) to address the intrinsic complexity of hand articulation and keypoint visibility inherent to human-computer interaction tasks (Cheng et al., 2 Feb 2026).

1. Architectural Overview and Model Formulation

HandMCM processes three input modalities: a single-hand RGB image $R \in \mathbb{R}^{H \times W \times 3}$ , a depth image $D \in \mathbb{R}^{H \times W}$ , and a point cloud $P \in \mathbb{R}^{N \times 3}$ (sampled from the depth map). It constructs a fused feature representation using a multi-modal super point encoder:

3D downsampling: PointNet++-style set convolution reduces $P$ to $N' = N/2$ super points $P' \in \mathbb{R}^{N' \times 3}$ , extracting geometric features $F_p \in \mathbb{R}^{N' \times C_p}$ .
2D encoding: ResNet-based autoencoders generate $F_d \in \mathbb{R}^{H/2 \times W/2 \times C_d}$ from $D$ and $F_\text{rgb} \in \mathbb{R}^{H/2 \times W/2 \times C_\text{rgb}}$ from $R$ . Each is projected onto $P'$ (via nearest-neighbor/barycentric interpolation), yielding $F_{d\rightarrow p}$ and $F_{\text{rgb}\rightarrow p}$ .
Feature fusion: All modalities are concatenated $F = [F_p \| F_{d\rightarrow p} \| F_{\text{rgb}\rightarrow p}]$ .

Keypoint token extraction is achieved by aggregating $F$ into a global context vector $G$ , followed by a bias-induced layer (BIL) that instantiates $J$ learnable keypoint tokens $X_0$ (each with a per-keypoint bias). An initial 3D keypoint estimate $J_0$ is obtained through a linear regression head.

The core processing occurs in $K$ stacked “correspondence Mamba” blocks, each refining the tokens via SSM and correspondence modeling. Each block applies a bidirectional gated SSM (BiGS), outputting a dynamically learned correspondence map that models spatial relationships and interactions across keypoints.

2. Dynamic Correspondence Modeling via Bidirectional State-Space Models

HandMCM reframes the traditional keypoint association problem by treating the ordered keypoint tokens as a 1D sequence (as opposed to a fixed graphical topology). Within each block $k$ :

Forward/Backward streams: A normalized input $\widetilde{X}$ is projected via learned matrices. GELU activations and two separate projections yield $X_f$ (forward) and $X_b$ (backward), passed through dedicated SSM layers to produce $U_f$ and $U_b$ .
Correspondence map: The outer product $U_f \otimes \text{Reverse}(U_b)$ is linearly projected to form $M_\text{corr} \in \mathbb{R}^{J \times J}$ , controlling spatial mixing among tokens.
Token update: Value projections $V$ are mixed via the correspondence map to yield $X_k$ (updated keypoint tokens) and projected to 3D coordinates $J_k = X_k W_r$ .

Local geometrical information is incorporated through a “local injection and filtering” mechanism. For each predicted keypoint, a local context is constructed from its $K$ nearest super points, concatenated with the keypoint’s global token and local geometric feature, and processed by a lightweight set-conv network. Local tokens are injected multiplicatively and combined through a learned gating mechanism, ensuring that each keypoint estimate is anchored to local evidence.

Comprehensive fusion of 2D and 3D cues is pivotal for occlusion robustness:

Point cloud super point features ( $F_p$ ) are obtained by downsampling $P$ .
Depth ( $F_d$ ) and RGB ( $F_\text{rgb}$ ) features are derived from independent ResNet autoencoders.
2D features are aligned with 3D space by projecting onto the super points corresponding to $P'$ .
The concatenated final feature per super point combines spatial, depth, and appearance cues, yielding a highly expressive representation for the SSM-driven keypoint regression.

4. Optimization, Loss Functions, and Training Protocol

HandMCM employs a block-wise loss defined as

$\mathcal{L} = \sum_{k=0}^K \sum_{j=1}^J \mathrm{smooth}_{L1}(j_{k,j} - j_{j}^*)$

where $\mathrm{smooth}_{L1}$ is defined piecewise:

$\mathrm{smooth}_{L1}(x) = \begin{cases} 0.5|x| & |x| < 0.01 \ |x| - 0.005 & \text{otherwise} \end{cases}$

The model is optimized using AdamW ( $\beta_1 = 0.5, \beta_2 = 0.999$ ) with a learning rate of $1\times 10^{-3}$ and batch size 32. Extensive data augmentation is applied, including random 3D rotation (±180°), scaling ([0.9, 1.1]), and translation (±10 mm). Epoch schedules are tailored for each benchmark (NYU: 40 epochs, HO3D: 24, DexYCB: 20, with staged decay).

5. Empirical Evaluation and Ablation Analysis

HandMCM was evaluated on NYU (depth-only), DexYCB, and HO3D (both RGBD) datasets with the following setup:

Dataset	Modalities	J (Keypoints)	Metric	HandMCM	Previous SOTA
NYU	Depth	14	Mean keypoint error (mm)	7.06	7.12 (HandDAGT)
DexYCB	RGBD	21	Avg. MKE over S0–S3 (mm)	6.67	7.54 (K-Fusion)
HO3D	RGBD	21	Mean keypoint error (cm)	1.71	1.79 (K-Fusion)

HandMCM demonstrates state-of-the-art performance, outperforming previous graph-based and standard SSM methods, especially under severe occlusion. Qualitative and quantitative ablations indicate that the combination of correspondence SSM and local token injection/filtering yields the most significant gains (e.g., reduction from 8.47 mm to 7.06 mm on NYU). Increasing Mamba block depth up to three layers yields optimal results.

6. Algorithmic Workflow

A high-level pseudocode sketch illustrates the sequential processing pipeline:

P_prime, F_p = SetConv(P)
F_d = ResNetAutoencoder(D)
F_rgb = ResNetAutoencoder(R)
F_{d->p}, F_{rgb->p} = ProjectOntoSuperpoints(F_d, F_rgb, P_prime)
F = Concatenate([F_p, F_{d->p}, F_{rgb->p}])
G = SetConvGlobal(F)
X_0 = BIL(G, J)  # Initial keypoint tokens
J_0 = Linear(X_0)

for k in range(1, K+1):
    X = LayerNorm(X_{k-1})
    V, X_f, X_b = GELU(Proj(X)), GELU(W_f X), GELU(W_b Reverse(X))
    U_f, U_b = W_{u_f} SSM(X_f), W_{u_b} SSM(X_b)
    M_corr = W_c(OuterProduct(U_f, Reverse(U_b)))
    X_bar = M_corr * V
    for j in range(J):
        X_loc = LocalSetConv(j, P_prime, F)
        X = LayerNorm(X * X_loc)
        G_j = Sigmoid(X_loc)
        J_k[j] = (G_j * X_bar[j] + (1-G_j) * X_loc) @ W_r
ComputeLossAllBlocks()
BackpropagateAndUpdate()

7. Scientific Contributions, Limitations, and Prospects

HandMCM introduces the first correspondence-based adaptation of the Mamba SSM for 3D hand pose estimation, enabling robust dynamic modeling of keypoint relationships that flexibly adapts to severe occlusion. It advances the field through:

Dynamic modeling of hand keypoints via a bidirectional, scan-path SSM rather than a rigid graph.
Local information injection and filtering mechanisms, anchoring inference in per-keypoint geometry.
Multi-modal super point encoding, synthesizing depth, RGB, and 3D geometric cues for enhanced occlusion handling.

Noted limitations include the model's applicability to only single-hand or hand–object interaction scenarios; bimanual or closely interacting hands remain challenging. Prospective future work targets cross-hand correspondence modeling and efficient SSM variants for real-time AR/VR deployment (Cheng et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Correspondence Mamba.