Skeleton-Cache for Zero-Shot Action Recognition

Updated 19 December 2025

Skeleton-Cache is a training-free test-time adaptation framework that builds a dynamic, non-parametric cache of structured skeleton descriptors for unseen action classes.
It integrates global, spatial, and temporal descriptors using semantic priors from large language models, enabling robust adaptation without gradient updates.
Empirical results show significant performance improvements across datasets, confirming the practical benefits of dynamic cache retrieval and LLM-guided fusion.

Skeleton-Cache is a training-free test-time adaptation framework designed to improve the generalization of skeleton-based zero-shot action recognition (SZAR) models to unseen action classes during inference. It reframes the inference process as a non-parametric, retrieval-based classification over a dynamically constructed cache of structured skeleton descriptors, combining both global and fine-grained local representations. Skeleton-Cache uniquely integrates semantic priors from LLMs to guide the fusion of descriptor-wise predictions, enabling dynamic adaptation to unseen action distributions without additional training or access to original training data (Zhu et al., 12 Dec 2025).

1. Motivation and Problem Setting

Skeleton-based zero-shot action recognition seeks to classify actions represented as skeleton sequences, where test-time classes are disjoint from those available during training. Conventional test-time adaptation methods either require gradient-based updates to the model—which introduces risk of overfitting and inference latency—or rely solely on a holistic skeleton embedding, neglecting spatial and temporal details critical for action discrimination. Skeleton-Cache addresses these challenges through two core innovations: (a) a lightweight non-parametric cache that stores structured descriptors dynamically collected from confident test exemplars, and (b) per-class, per-descriptor fusion weights derived from LLMs, which ensure semantically meaningful adaptation to new action categories.

2. Cache Architecture and Descriptor Representation

The Skeleton-Cache comprises $|\mathcal{Y}_u|$ class-specific blocks, each capable of holding up to $K$ entries, where $\mathcal{Y}_u$ is the set of unseen classes. Each cache entry is a triplet $(e_k, e_y, e_h)$ :

$e_k \in \mathbb{R}^{D\times N}$ : Cache key, a concatenation of $(P+Z+1)$ descriptors each of dimension $N$ , representing global, spatial (body-part), and temporal (motion phase) features.
$e_y \in \{0,1\}^{|\mathcal{Y}_u|}$ : One-hot class label.
$e_h \in \mathbb{R}$ : Prediction confidence scored as negative entropy.

Descriptors are extracted from the frozen SZAR backbone. Given $F \in \mathbb{R}^{N \times T \times V}$ , the backbone feature tensor:

Global descriptor:

$\mathbf{g} = \frac{1}{VT} \sum_{t=1}^T \sum_{v=1}^V F_{:, t, v} \in \mathbb{R}^N$

Local spatial descriptors for $p = 1, \dots, P$ body-part groups:

$\mathbf{s}_p = \frac{1}{|\mathcal{V}_p|T} \sum_{v \in \mathcal{V}_p} \sum_{t=1}^T F_{:, t, v} \in \mathbb{R}^N$

Local temporal descriptors for $z = 1, \dots, Z$ phases:

$\mathbf{t}_z = \frac{1}{V|\mathcal{T}_z|} \sum_{t \in \mathcal{T}_z} \sum_{v=1}^V F_{:, t, v} \in \mathbb{R}^N$

These are concatenated to form the cache key:

$e_k = \mathrm{concat}\bigl(\mathbf{g}, \mathbf{s}_1, \ldots, \mathbf{s}_P, \mathbf{t}_1, \ldots, \mathbf{t}_Z\bigr) \in \mathbb{R}^{(P+Z+1)\times N}$

The cache is constructed online during testing, with only high-confidence predictions used as exemplars. There are no gradient updates or access to training data; cache updates and lookups are computationally efficient.

3. Retrieval and Fusion Mechanism

Test-time adaptation consists of querying the cache and fusing predictions:

For each incoming sample $x$ , query embeddings $\{q^{(d)}\}_{d=0}^{P+Z}$ are computed as described above.
Affinity scores between the query descriptor and cached descriptors are computed using cosine similarity and an exponential temperature kernel:

$a_{j,i}^{(d)} = \exp\left[-\beta\left(1 - \cos(q^{(d)}, k_{j,i}^{(d)})\right)\right]$

where $k_{j,i}^{(d)}$ is the $d$ -th descriptor for class $j$ , slot $i$ , and $\beta$ is a hyperparameter.

The descriptor-wise logit vector for each type $d$ is computed as:

$o^{(d)} = a^{(d)} Y \in \mathbb{R}^{1 \times |\mathcal{Y}_u|}$

where $Y$ is a label matrix of cached entries.

Aggregating over all descriptor types yields $O \in \mathbb{R}^{(P+Z+1)\times |\mathcal{Y}_u|}$ .

4. Semantic Priors via LLMs

Prior knowledge about the importance of global, spatial, and temporal descriptors for each action class is extracted using GPT-4, with a single prompt per class. The LLM is tasked to output:

"spatial": four nonnegative weights (over [Head, Torso, Arms, Legs]) summing to 1,
"temporal": three nonnegative weights (over [Begin, Middle, End]) summing to 1,
"gamma": $\gamma \in [0,1]$ , global/locale trade-off.

The resulting per-class weight vector is:

$\tilde w^{(c)} = \left[\gamma, (1-\gamma) w_\text{spa}^{(1)}, \ldots, (1-\gamma) w_\text{spa}^{(P)}, (1-\gamma) w_\text{tmp}^{(1)}, \ldots, (1-\gamma) w_\text{tmp}^{(Z)}\right]$

$\tilde w^{(c)}$ is then $\ell_1$ -normalized to $w^{(c)} \in \mathbb{R}^{P+Z+1}$ . These semantic priors are statically precomputed per class for all unseen actions.

5. End-to-End Inference and Adaptation Protocol

For a test sample, the full protocol can be summarized as:

Extract skeleton descriptors ( $\mathbf{g}, \{\mathbf{s}_p\}, \{\mathbf{t}_z\}$ ).
Compute original SZAR logits $\hat{\phi}$ and entropy $h$ .
Update the cache for the predicted class $\hat{y}$ if the current slot count $<K$ or entry is higher confidence (lower $h$ ).
For each descriptor $d$ , compute affinity $a^{(d)}$ with all cache entries and descriptor-wise logits $o^{(d)}$ .
Stack descriptor logits as $O$ .
Fuse using class-specific weights: $\mathbf{s} = w^{(c)} O$ .
Compute final logit: $\varphi = \hat{\phi} + \alpha_s \mathbf{s}$ , where $\alpha_s$ is a fusion coefficient.
Predict $\arg\max \mathrm{softmax}(\varphi)$ .

A lightly annotated pseudocode encapsulates this process, emphasizing the non-parametric, update-on-the-fly memory, and descriptor-weighted fusion.

6. Empirical Results and Evaluations

Skeleton-Cache has been evaluated across three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II. Two protocols are reported: Zero-shot (ZSL) and generalized zero-shot learning (GZSL), which respectively test on only unseen classes and on both seen/unseen classes (harmonic mean metric).

Main Results (selected):

Backbone	ZSL↑	↑ vs base	GZSL H↑	↑ vs base
PURLS (55/5)	79.22%	—	—	—
PURLS + Skeleton-Cache	85.46%	+6.24pp	—	—
SA-DVAE (55/5)	82.37%	—	66.27%	—
SA-DVAE + Skeleton-Cache	89.41%	+7.04pp	71.15%	+4.88pp

Method	NTU60 ZSL↑	GZSL H↑	NTU120 ZSL↑	GZSL H↑	PKU-MMD ZSL↑	GZSL H↑
SA-DVAE	84.20%	75.27%	50.67%	47.54%	66.54%	54.72%
SA-DVAE + Skeleton-Cache	89.86% (+5.66pp)	80.21% (+4.94pp)	56.18% (+5.51pp)	51.94% (+4.40pp)	71.05% (+4.51pp)	58.49% (+3.77pp)

Ablation studies confirm gains arise primarily from structured descriptor decompositions and LLM-guided fusion:

LLM-guided weights outperform random or uniform weighting by 1.8–2.8 percentage points.
Using all (global + spatial + temporal) descriptors yields $+6.24$ percentage points over global-only fusion.
Optimal hyperparameters: cache size $K \approx 8$ per class, similarity temperature $\beta \approx 3.0$ , fusion coefficient $\alpha_s \approx 5.0$ .

7. Analysis, Limitations, and Open Directions

Skeleton-Cache demonstrates that structured, non-parametric caches provide a readily adaptable auxiliary memory for test-time generalization in skeleton-based action recognition. The integration of LLM semantic priors improves interpretability and performance by aligning the decision process with human commonsense about the relevance of body parts and motion phases to specific actions.

Limitations include dependence on a high-quality, frozen SZAR backbone: poor encodings under severe occlusion or domain shift can degrade cache utility. While current cache memory and class-specific prior strategies are effective for moderate vocabulary, extreme scaling may raise memory efficiency concerns. LLM-derived weights are presently static per class, while compound or rare actions might benefit from contextually dynamic weighting or multimodal sources.

In summary, Skeleton-Cache establishes a fully training-free, non-parametric test-time adaptation paradigm that (1) decomposes skeleton inference into descriptor-based retrieval, (2) fuses predictions using LLM-extracted semantic priors, and (3) achieves consistent SZAR improvements across datasets and models with modest overhead (Zhu et al., 12 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Skeleton-Cache.