Skeleton-Cache for Zero-Shot Action Recognition
- Skeleton-Cache is a training-free test-time adaptation framework that builds a dynamic, non-parametric cache of structured skeleton descriptors for unseen action classes.
- It integrates global, spatial, and temporal descriptors using semantic priors from large language models, enabling robust adaptation without gradient updates.
- Empirical results show significant performance improvements across datasets, confirming the practical benefits of dynamic cache retrieval and LLM-guided fusion.
Skeleton-Cache is a training-free test-time adaptation framework designed to improve the generalization of skeleton-based zero-shot action recognition (SZAR) models to unseen action classes during inference. It reframes the inference process as a non-parametric, retrieval-based classification over a dynamically constructed cache of structured skeleton descriptors, combining both global and fine-grained local representations. Skeleton-Cache uniquely integrates semantic priors from LLMs to guide the fusion of descriptor-wise predictions, enabling dynamic adaptation to unseen action distributions without additional training or access to original training data (Zhu et al., 12 Dec 2025).
1. Motivation and Problem Setting
Skeleton-based zero-shot action recognition seeks to classify actions represented as skeleton sequences, where test-time classes are disjoint from those available during training. Conventional test-time adaptation methods either require gradient-based updates to the model—which introduces risk of overfitting and inference latency—or rely solely on a holistic skeleton embedding, neglecting spatial and temporal details critical for action discrimination. Skeleton-Cache addresses these challenges through two core innovations: (a) a lightweight non-parametric cache that stores structured descriptors dynamically collected from confident test exemplars, and (b) per-class, per-descriptor fusion weights derived from LLMs, which ensure semantically meaningful adaptation to new action categories.
2. Cache Architecture and Descriptor Representation
The Skeleton-Cache comprises class-specific blocks, each capable of holding up to entries, where is the set of unseen classes. Each cache entry is a triplet :
- : Cache key, a concatenation of descriptors each of dimension , representing global, spatial (body-part), and temporal (motion phase) features.
- : One-hot class label.
- : Prediction confidence scored as negative entropy.
Descriptors are extracted from the frozen SZAR backbone. Given , the backbone feature tensor:
- Global descriptor:
- Local spatial descriptors for body-part groups:
- Local temporal descriptors for phases:
These are concatenated to form the cache key:
The cache is constructed online during testing, with only high-confidence predictions used as exemplars. There are no gradient updates or access to training data; cache updates and lookups are computationally efficient.
3. Retrieval and Fusion Mechanism
Test-time adaptation consists of querying the cache and fusing predictions:
- For each incoming sample , query embeddings are computed as described above.
- Affinity scores between the query descriptor and cached descriptors are computed using cosine similarity and an exponential temperature kernel:
where is the -th descriptor for class , slot , and is a hyperparameter.
- The descriptor-wise logit vector for each type is computed as:
where is a label matrix of cached entries.
- Aggregating over all descriptor types yields .
4. Semantic Priors via LLMs
Prior knowledge about the importance of global, spatial, and temporal descriptors for each action class is extracted using GPT-4, with a single prompt per class. The LLM is tasked to output:
- "spatial": four nonnegative weights (over [Head, Torso, Arms, Legs]) summing to 1,
- "temporal": three nonnegative weights (over [Begin, Middle, End]) summing to 1,
- "gamma": , global/locale trade-off.
The resulting per-class weight vector is:
is then -normalized to . These semantic priors are statically precomputed per class for all unseen actions.
5. End-to-End Inference and Adaptation Protocol
For a test sample, the full protocol can be summarized as:
- Extract skeleton descriptors ().
- Compute original SZAR logits and entropy .
- Update the cache for the predicted class if the current slot count or entry is higher confidence (lower ).
- For each descriptor , compute affinity with all cache entries and descriptor-wise logits .
- Stack descriptor logits as .
- Fuse using class-specific weights: .
- Compute final logit: , where is a fusion coefficient.
- Predict .
A lightly annotated pseudocode encapsulates this process, emphasizing the non-parametric, update-on-the-fly memory, and descriptor-weighted fusion.
6. Empirical Results and Evaluations
Skeleton-Cache has been evaluated across three large-scale datasets: NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II. Two protocols are reported: Zero-shot (ZSL) and generalized zero-shot learning (GZSL), which respectively test on only unseen classes and on both seen/unseen classes (harmonic mean metric).
Main Results (selected):
| Backbone | ZSL↑ | ↑ vs base | GZSL H↑ | ↑ vs base |
|---|---|---|---|---|
| PURLS (55/5) | 79.22% | — | — | — |
| PURLS + Skeleton-Cache | 85.46% | +6.24pp | — | — |
| SA-DVAE (55/5) | 82.37% | — | 66.27% | — |
| SA-DVAE + Skeleton-Cache | 89.41% | +7.04pp | 71.15% | +4.88pp |
| Method | NTU60 ZSL↑ | GZSL H↑ | NTU120 ZSL↑ | GZSL H↑ | PKU-MMD ZSL↑ | GZSL H↑ |
|---|---|---|---|---|---|---|
| SA-DVAE | 84.20% | 75.27% | 50.67% | 47.54% | 66.54% | 54.72% |
| SA-DVAE + Skeleton-Cache | 89.86% (+5.66pp) | 80.21% (+4.94pp) | 56.18% (+5.51pp) | 51.94% (+4.40pp) | 71.05% (+4.51pp) | 58.49% (+3.77pp) |
Ablation studies confirm gains arise primarily from structured descriptor decompositions and LLM-guided fusion:
- LLM-guided weights outperform random or uniform weighting by 1.8–2.8 percentage points.
- Using all (global + spatial + temporal) descriptors yields percentage points over global-only fusion.
- Optimal hyperparameters: cache size per class, similarity temperature , fusion coefficient .
7. Analysis, Limitations, and Open Directions
Skeleton-Cache demonstrates that structured, non-parametric caches provide a readily adaptable auxiliary memory for test-time generalization in skeleton-based action recognition. The integration of LLM semantic priors improves interpretability and performance by aligning the decision process with human commonsense about the relevance of body parts and motion phases to specific actions.
Limitations include dependence on a high-quality, frozen SZAR backbone: poor encodings under severe occlusion or domain shift can degrade cache utility. While current cache memory and class-specific prior strategies are effective for moderate vocabulary, extreme scaling may raise memory efficiency concerns. LLM-derived weights are presently static per class, while compound or rare actions might benefit from contextually dynamic weighting or multimodal sources.
In summary, Skeleton-Cache establishes a fully training-free, non-parametric test-time adaptation paradigm that (1) decomposes skeleton inference into descriptor-based retrieval, (2) fuses predictions using LLM-extracted semantic priors, and (3) achieves consistent SZAR improvements across datasets and models with modest overhead (Zhu et al., 12 Dec 2025).