Papers
Topics
Authors
Recent
Search
2000 character limit reached

SGTM: Skeleton Guided Temporal Modeling

Updated 24 November 2025
  • SGTM is a module that injects motion-aware temporal cues from privileged skeleton data into video-based person ReID pipelines, enhancing temporal modeling in ViT-based encoders.
  • It leverages a Learning Using Privileged Information approach by integrating Message Token Encoding, Auxiliary Temporal Distillation, and Temporal Aggregation to fuse skeleton and visual features.
  • Empirical results demonstrate measurable gains, such as a 1.7% mAP improvement on the MARS benchmark, confirming its effectiveness in spatiotemporal representation.

Skeleton Guided Temporal Modeling (SGTM) is a module designed for injecting motion-aware temporal cues, distilled from privileged skeleton data, into the visual representation learning process for video-based person re-identification (ReID). SGTM is a key component of the CSIP-ReID (Contrastive Skeleton-Image Pretraining for ReID) framework and leverages the Learning Using Privileged Information (LUPI) paradigm by utilizing skeleton sequences during training, but not inference, to bootstrap temporal modeling in transformer-based visual encoders. SGTM aims to address the deficiency of temporal modeling capacity in standard ViT-based visual pipelines, enabling the resulting models to robustly capture spatiotemporal discriminative features critical to person ReID benchmarks (Lin et al., 17 Nov 2025).

1. Role and Motivation of Skeleton Guided Temporal Modeling

SGTM is introduced in Stage 2 of the CSIP-ReID pipeline, following the Prototype Fusion Updater (PFU). SGTM’s primary purpose is to remedy the lack of explicit temporal modeling in vanilla ViT-based visual backbones, which focus mainly on static spatial information. By employing skeleton sequences (obtained and projected to compatible dimensions) as an auxiliary supervision signal under LUPI, SGTM transfers motion patterns into the visual feature stream such that, during inference, temporal-aware representations are accessible even when only RGB video is available. This approach enables the model to exploit the spatiotemporal structure of skeleton data—regarded as a privileged modality at training time—for endowing the visual encoder with temporal sensitivity.

2. Input and Output Specifications

SGTM operates on synchronized visual and skeleton token sequences output by the preceding stages:

  • Visual token sequence: Xvis={xt,ivis}RT×(1+Np)×CX^{vis} = \{ x^{vis}_{t,i} \} \in \mathbb{R}^{T \times (1+N_p) \times C}, where TT is the number of frames, NpN_p is the number of ViT patches per frame, and CC is the feature dimensionality.
  • Skeleton token sequence: Xske={xt,jske}RT×(1+J)×CX^{ske} = \{ x^{ske}_{t,j} \} \in \mathbb{R}^{T \times (1+J) \times C}, with JJ the number of skeleton joints (e.g., 17).
  • Output: Refined frame-level logits zRB×T×Kz \in \mathbb{R}^{B\times T\times K}, where BB is batch size and KK is the number of person identities, used in a framewise cross-entropy training loss.

3. SGTM Architectural Design

SGTM comprises three sequential submodules: Message Token Encoding (MTE), Auxiliary Temporal Distillation (ATD), and Temporal Aggregation (TA).

3.1 Message Token Encoding (MTE)

For each frame tt, SGTM pools the set of visual tokens into an average “message” vector vtv_t and the skeleton tokens into sts_t:

vt=Pool(Xtvis)RC,st=Pool(Xtske)RCv_t = \textrm{Pool}(X^{vis}_t) \in \mathbb{R}^C,\quad s_t = \textrm{Pool}(X^{ske}_t) \in \mathbb{R}^C

Projecting and stacking over TT frames yields sequences V=(v1,...,vT)RT×CV = (v_1, ..., v_T) \in \mathbb{R}^{T\times C} and S=(s1,...,sT)RT×CS = (s_1, ..., s_T) \in \mathbb{R}^{T\times C}. Each is processed independently by a lightweight temporal transformer (temporal multi-head self-attention, MHSA):

M~vis=MHSA(WvV),M~ske=MHSA(WsS)\widetilde{M}^{vis} = \textrm{MHSA}(W_v V),\quad \widetilde{M}^{ske} = \textrm{MHSA}(W_s S)

Wv,WsRC×CW_v, W_s \in \mathbb{R}^{C \times C} are learnable projections. This produces temporally encoded messages for both modalities.

3.2 Auxiliary Temporal Distillation (ATD)

ATD performs cross-attention over the temporally encoded messages to distill privileged skeleton dynamics into visual features. At each frame tt:

m^tvis=CrossAttn(q=m~tvis,  k=m~tske,  v=m~tske)=softmax((m~tvisWQ)(m~tskeWK)d)(m~tskeWV)\hat{m}^{vis}_t = \mathrm{CrossAttn}\bigl(q=\widetilde{m}^{vis}_t,\;k=\widetilde{m}^{ske}_t,\;v=\widetilde{m}^{ske}_t\bigr) = \mathrm{softmax}\Bigl(\frac{(\widetilde{m}^{vis}_t W_Q)(\widetilde{m}^{ske}_t W_K)^\top}{\sqrt{d}}\Bigr)(\widetilde{m}^{ske}_t W_V)

Here, WQ,WK,WVRC×CW_Q, W_K, W_V \in \mathbb{R}^{C \times C} are cross-attention projections, and d=Cd = C. Additional “type embeddings” ER4×CE \in \mathbb{R}^{4\times C} differentiate token types {xvis,m~vis,xske,m~ske}\{x^{vis}, \widetilde{m}^{vis}, x^{ske}, \widetilde{m}^{ske}\}.

3.3 Temporal Aggregation (TA)

For each frame tt, a unified token sequence is created by concatenating:

  • per-frame visual patch tokens xt,1vis,...,xt,1+Npvisx^{vis}_{t,1}, ..., x^{vis}_{t,1+N_p}
  • the distilled visual message m^tvis\hat{m}^{vis}_t
  • the skeleton patch tokens xt,1ske,...,xt,1+Jskex^{ske}_{t,1}, ..., x^{ske}_{t,1+J}
  • the temporally encoded skeleton message m~tske\widetilde{m}^{ske}_t

Formally:

U(t)=[xt,1vis,...,xt,1+Npvism^tvisxt,1ske,...,xt,1+Jskem~tske]RL×CU^{(t)} = [x^{vis}_{t,1}, ..., x^{vis}_{t,1+N_p} \,\|\, \hat{m}^{vis}_t\, \|\, x^{ske}_{t,1}, ..., x^{ske}_{t,1+J} \,\|\, \widetilde{m}^{ske}_t ] \in \mathbb{R}^{L \times C}

with L=(1+Np)+1+(1+J)+1L = (1+N_p) + 1 + (1+J) + 1. Aggregated over BB batches and TT frames, this yields UR(BT)×L×CU \in \mathbb{R}^{(B T) \times L \times C}. A standard transformer block is applied along the time dimension to UU, followed by attention pooling to reduce the token dimension to one, resulting in per-frame logits zi,tRKz_{i,t} \in \mathbb{R}^K. Notably, skeleton-derived tokens are dropped at inference, so only {xvis,m^vis}\{x^{vis}, \hat{m}^{vis}\} are retained.

4. Mathematical Formulation

The key operations in SGTM can be summarized as:

  • Temporal encoding for both modalities:

M~vis=ftemp(V),M~ske=ftemp(S)\widetilde{M}^{vis} = f_{temp}(V),\quad \widetilde{M}^{ske} = f_{temp}(S)

  • Cross-attention distillation at each timestep:

m^tvis=softmax((m~tvisWQ)(m~tskeWK)d)(m~tskeWV)\hat{m}^{vis}_t = \mathrm{softmax}\Bigl(\frac{(\widetilde{m}^{vis}_t W_Q)(\widetilde{m}^{ske}_t W_K)^\top}{\sqrt{d}}\Bigr) (\widetilde{m}^{ske}_t W_V)

  • Concatenation, temporal transformer, and attention pooling to produce zi,tz_{i,t}.

SGTM’s only novel objective is the frame-level cross-entropy,

LFrame=i=1Bt=1Tk=1Kqi,t,klogpi,t,k\mathcal{L}_{Frame} = - \sum_{i=1}^B\sum_{t=1}^T\sum_{k=1}^K q_{i,t,k} \log p_{i,t,k}

where pi,t,k=softmaxk(zi,t)p_{i,t,k} = \mathrm{softmax}_k(z_{i,t}) and qi,t,kq_{i,t,k} is the smoothed one-hot frame label. This term contributes to the total Stage 2 loss with weight λ2\lambda_2:

Lstage2=LCE+LTriplet+λ1LCSIP+λ2LFrame\mathcal{L}_{stage2} = \mathcal{L}_{CE} + \mathcal{L}_{Triplet} + \lambda_1 \mathcal{L}_{CSIP} + \lambda_2 \mathcal{L}_{Frame}

5. Empirical Results and Ablation Analysis

Ablations quantifying SGTM on top of prototype fusion but without prototype updates indicate the following performance increments on two major video ReID benchmarks (Lin et al., 17 Nov 2025):

Dataset mAP Gain (%) Rank-1 Gain (%)
MARS +1.7 (88.4→90.1) +1.1 (92.3→93.4)
LS-VID +0.6 (82.8→83.4) +0.5 (91.0→91.5)

The benefit observed confirms SGTM’s role in efficiently injecting fine-grained temporal structure from privileged skeleton motion into the video stream, yielding measurable improvements on standard metrics.

6. Key Hyper-parameters for Reproduction and Extension

The primary hyper-parameters necessary for reproducing or extending SGTM are:

  • Number of frames per tracklet TT: 8
  • Visual patch tokens per frame NpN_p: e.g., 1 (CLS token) + 16×8=12916 \times 8 = 129 for ViT-B/16
  • Skeleton joints JJ: 17
  • Feature dimension CC: 768 (aligned with ViT-B/16)
  • Temporal MHSA heads hh: 12 (ViT default; can use 8 for lightweight)
  • Transformer MLP hidden dimension: 4×C4 \times C
  • λ1\lambda_1 (CSIP prototype loss weight): 1.0
  • λ2\lambda_2 (SGTM frame loss weight): 1.3
  • Stage 1 contrastive temperature τ\tau: 0.07

These parameters ensure close alignment with the published CSIP-ReID experiments and are sufficient for efficient PyTorch implementation.

7. Significance within Multimodal Representation Learning

SGTM substantiates the impact of privileged skeleton-guided temporal reasoning for person ReID, offering an annotation-free, motion-aware alternative to video-text pretraining. Its architectural integration allows ViT-based encoders to gain expressivity in the temporal domain specifically via skeleton privilege distillation. A plausible implication is that SGTM’s LUPI-based design could generalize to other domains where motion or temporally privileged signals are available exclusively during training, catalyzing future research at the intersection of multimodal and temporal visual understanding (Lin et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skeleton Guided Temporal Modeling (SGTM).