SGTM: Skeleton Guided Temporal Modeling

Updated 24 November 2025

SGTM is a module that injects motion-aware temporal cues from privileged skeleton data into video-based person ReID pipelines, enhancing temporal modeling in ViT-based encoders.
It leverages a Learning Using Privileged Information approach by integrating Message Token Encoding, Auxiliary Temporal Distillation, and Temporal Aggregation to fuse skeleton and visual features.
Empirical results demonstrate measurable gains, such as a 1.7% mAP improvement on the MARS benchmark, confirming its effectiveness in spatiotemporal representation.

Skeleton Guided Temporal Modeling (SGTM) is a module designed for injecting motion-aware temporal cues, distilled from privileged skeleton data, into the visual representation learning process for video-based person re-identification (ReID). SGTM is a key component of the CSIP-ReID (Contrastive Skeleton-Image Pretraining for ReID) framework and leverages the Learning Using Privileged Information (LUPI) paradigm by utilizing skeleton sequences during training, but not inference, to bootstrap temporal modeling in transformer-based visual encoders. SGTM aims to address the deficiency of temporal modeling capacity in standard ViT-based visual pipelines, enabling the resulting models to robustly capture spatiotemporal discriminative features critical to person ReID benchmarks (Lin et al., 17 Nov 2025).

1. Role and Motivation of Skeleton Guided Temporal Modeling

SGTM is introduced in Stage 2 of the CSIP-ReID pipeline, following the Prototype Fusion Updater (PFU). SGTM’s primary purpose is to remedy the lack of explicit temporal modeling in vanilla ViT-based visual backbones, which focus mainly on static spatial information. By employing skeleton sequences (obtained and projected to compatible dimensions) as an auxiliary supervision signal under LUPI, SGTM transfers motion patterns into the visual feature stream such that, during inference, temporal-aware representations are accessible even when only RGB video is available. This approach enables the model to exploit the spatiotemporal structure of skeleton data—regarded as a privileged modality at training time—for endowing the visual encoder with temporal sensitivity.

2. Input and Output Specifications

SGTM operates on synchronized visual and skeleton token sequences output by the preceding stages:

Visual token sequence: $X^{vis} = \{ x^{vis}_{t,i} \} \in \mathbb{R}^{T \times (1+N_p) \times C}$ , where $T$ is the number of frames, $N_p$ is the number of ViT patches per frame, and $C$ is the feature dimensionality.
Skeleton token sequence: $X^{ske} = \{ x^{ske}_{t,j} \} \in \mathbb{R}^{T \times (1+J) \times C}$ , with $J$ the number of skeleton joints (e.g., 17).
Output: Refined frame-level logits $z \in \mathbb{R}^{B\times T\times K}$ , where $B$ is batch size and $K$ is the number of person identities, used in a framewise cross-entropy training loss.

3. SGTM Architectural Design

SGTM comprises three sequential submodules: Message Token Encoding (MTE), Auxiliary Temporal Distillation (ATD), and Temporal Aggregation (TA).

3.1 Message Token Encoding (MTE)

For each frame $t$ , SGTM pools the set of visual tokens into an average “message” vector $v_t$ and the skeleton tokens into $s_t$ :

$v_t = \textrm{Pool}(X^{vis}_t) \in \mathbb{R}^C,\quad s_t = \textrm{Pool}(X^{ske}_t) \in \mathbb{R}^C$

Projecting and stacking over $T$ frames yields sequences $V = (v_1, ..., v_T) \in \mathbb{R}^{T\times C}$ and $S = (s_1, ..., s_T) \in \mathbb{R}^{T\times C}$ . Each is processed independently by a lightweight temporal transformer (temporal multi-head self-attention, MHSA):

$\widetilde{M}^{vis} = \textrm{MHSA}(W_v V),\quad \widetilde{M}^{ske} = \textrm{MHSA}(W_s S)$

$W_v, W_s \in \mathbb{R}^{C \times C}$ are learnable projections. This produces temporally encoded messages for both modalities.

3.2 Auxiliary Temporal Distillation (ATD)

ATD performs cross-attention over the temporally encoded messages to distill privileged skeleton dynamics into visual features. At each frame $t$ :

$\hat{m}^{vis}_t = \mathrm{CrossAttn}\bigl(q=\widetilde{m}^{vis}_t,\;k=\widetilde{m}^{ske}_t,\;v=\widetilde{m}^{ske}_t\bigr) = \mathrm{softmax}\Bigl(\frac{(\widetilde{m}^{vis}_t W_Q)(\widetilde{m}^{ske}_t W_K)^\top}{\sqrt{d}}\Bigr)(\widetilde{m}^{ske}_t W_V)$

Here, $W_Q, W_K, W_V \in \mathbb{R}^{C \times C}$ are cross-attention projections, and $d = C$ . Additional “type embeddings” $E \in \mathbb{R}^{4\times C}$ differentiate token types $\{x^{vis}, \widetilde{m}^{vis}, x^{ske}, \widetilde{m}^{ske}\}$ .

3.3 Temporal Aggregation (TA)

For each frame $t$ , a unified token sequence is created by concatenating:

per-frame visual patch tokens $x^{vis}_{t,1}, ..., x^{vis}_{t,1+N_p}$
the distilled visual message $\hat{m}^{vis}_t$
the skeleton patch tokens $x^{ske}_{t,1}, ..., x^{ske}_{t,1+J}$
the temporally encoded skeleton message $\widetilde{m}^{ske}_t$

Formally:

$U^{(t)} = [x^{vis}_{t,1}, ..., x^{vis}_{t,1+N_p} \,\|\, \hat{m}^{vis}_t\, \|\, x^{ske}_{t,1}, ..., x^{ske}_{t,1+J} \,\|\, \widetilde{m}^{ske}_t ] \in \mathbb{R}^{L \times C}$

with $L = (1+N_p) + 1 + (1+J) + 1$ . Aggregated over $B$ batches and $T$ frames, this yields $U \in \mathbb{R}^{(B T) \times L \times C}$ . A standard transformer block is applied along the time dimension to $U$ , followed by attention pooling to reduce the token dimension to one, resulting in per-frame logits $z_{i,t} \in \mathbb{R}^K$ . Notably, skeleton-derived tokens are dropped at inference, so only $\{x^{vis}, \hat{m}^{vis}\}$ are retained.

4. Mathematical Formulation

The key operations in SGTM can be summarized as:

Temporal encoding for both modalities:

$\widetilde{M}^{vis} = f_{temp}(V),\quad \widetilde{M}^{ske} = f_{temp}(S)$

Cross-attention distillation at each timestep:

$\hat{m}^{vis}_t = \mathrm{softmax}\Bigl(\frac{(\widetilde{m}^{vis}_t W_Q)(\widetilde{m}^{ske}_t W_K)^\top}{\sqrt{d}}\Bigr) (\widetilde{m}^{ske}_t W_V)$

Concatenation, temporal transformer, and attention pooling to produce $z_{i,t}$ .

SGTM’s only novel objective is the frame-level cross-entropy,

$\mathcal{L}_{Frame} = - \sum_{i=1}^B\sum_{t=1}^T\sum_{k=1}^K q_{i,t,k} \log p_{i,t,k}$

where $p_{i,t,k} = \mathrm{softmax}_k(z_{i,t})$ and $q_{i,t,k}$ is the smoothed one-hot frame label. This term contributes to the total Stage 2 loss with weight $\lambda_2$ :

$\mathcal{L}_{stage2} = \mathcal{L}_{CE} + \mathcal{L}_{Triplet} + \lambda_1 \mathcal{L}_{CSIP} + \lambda_2 \mathcal{L}_{Frame}$

5. Empirical Results and Ablation Analysis

Ablations quantifying SGTM on top of prototype fusion but without prototype updates indicate the following performance increments on two major video ReID benchmarks (Lin et al., 17 Nov 2025):

Dataset	mAP Gain (%)	Rank-1 Gain (%)
MARS	+1.7 (88.4→90.1)	+1.1 (92.3→93.4)
LS-VID	+0.6 (82.8→83.4)	+0.5 (91.0→91.5)

The benefit observed confirms SGTM’s role in efficiently injecting fine-grained temporal structure from privileged skeleton motion into the video stream, yielding measurable improvements on standard metrics.

6. Key Hyper-parameters for Reproduction and Extension

The primary hyper-parameters necessary for reproducing or extending SGTM are:

Number of frames per tracklet $T$ : 8
Visual patch tokens per frame $N_p$ : e.g., 1 (CLS token) + $16 \times 8 = 129$ for ViT-B/16
Skeleton joints $J$ : 17
Feature dimension $C$ : 768 (aligned with ViT-B/16)
Temporal MHSA heads $h$ : 12 (ViT default; can use 8 for lightweight)
Transformer MLP hidden dimension: $4 \times C$
$\lambda_1$ (CSIP prototype loss weight): 1.0
$\lambda_2$ (SGTM frame loss weight): 1.3
Stage 1 contrastive temperature $\tau$ : 0.07

These parameters ensure close alignment with the published CSIP-ReID experiments and are sufficient for efficient PyTorch implementation.

7. Significance within Multimodal Representation Learning

SGTM substantiates the impact of privileged skeleton-guided temporal reasoning for person ReID, offering an annotation-free, motion-aware alternative to video-text pretraining. Its architectural integration allows ViT-based encoders to gain expressivity in the temporal domain specifically via skeleton privilege distillation. A plausible implication is that SGTM’s LUPI-based design could generalize to other domains where motion or temporally privileged signals are available exclusively during training, catalyzing future research at the intersection of multimodal and temporal visual understanding (Lin et al., 17 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skeleton Guided Temporal Modeling (SGTM).