SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition

Published 31 Mar 2026 in cs.CV | (2603.29692v1)

Abstract: Zero-shot skeleton-based action recognition aims to recognize unseen actions by transferring knowledge from seen categories through semantic descriptions. Most existing methods typically align skeleton features with textual embeddings within a shared latent space. However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. Specifically, we introduce a Cross-Modal Context Prompt Module, which leverages a pretrained LLM to reconstruct masked contextual prompts under guidance derived from LLMs. This design effectively transfers linguistic context to the skeleton encoder for instance-level semantic grounding and improved cross-modal alignment. In addition, a Key-Part Decoupling Module is incorporated to decouple motion-relevant joint features, ensuring robust action understanding even in the absence of explicit object interactions. Extensive experiments on multiple benchmarks demonstrate that SkeletonContext achieves state-of-the-art performance under both conventional and generalized zero-shot settings, validating its effectiveness in reasoning about context and distinguishing fine-grained, visually similar actions.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a novel context-aware framework that leverages language-driven cues to bridge the semantic gap in zero-shot skeleton-based action recognition.
It integrates a cross-modal context prompt module and a key-part decoupling module to enhance fine-grained motion discrimination and robust action understanding.
Experimental results on NTU and PKU benchmarks show significant gains, with ZSL accuracies up to 89.6% and balanced performance under GZSL protocols.

SkeletonContext: Context-Aware Zero-Shot Skeleton-Based Action Recognition

Motivation and Novelty

The SkeletonContext framework addresses the persistent semantic gap in zero-shot skeleton-based action recognition (ZSSAR), where traditional methods directly align skeletal features with textual descriptions, but fail to account for contextual cues required to distinguish visually similar actions. ZSSAR imposes the challenge of generalizing from source (seen) categories to target (unseen) categories relying on semantic descriptions. The absence of contextual information—such as objects and environments in interaction—hampers discriminability especially in skeleton representations that encode only human joint coordinates.

SkeletonContext introduces a new paradigm, enriching skeletal motion features with language-driven contextual semantics by leveraging pretrained LLMs. This design reconstructs masked context prompts (objects, environment, target) and injects them directly into the skeleton encoding pipeline, fundamentally improving cross-modal alignment. Additionally, the Key-Part Decoupling (KPD) module ensures robust action understanding for context-independent actions by explicitly disentangling motion-relevant joint features.

Figure 1: Comparison between conventional ZSSAR approaches and SkeletonContext, illustrating context injection into skeleton representations for improved discrimination of similar actions.

Framework Overview

SkeletonContext is instantiated with two core modules:

Cross-Modal Context Prompt Module (CCPM): This module operates atop a pre-trained LLM and a skeleton encoder, reconstructing masked context slots in semantic prompts guided by LLMs. Skeleton features are mapped into the latent language space via differential joint encoding and cross-modal attention, infusing contextual semantics into skeleton representations.
Key Part Decoupling (KPD) Module: KPD learns to disentangle and emphasize motion-critical joints using language prior-driven importance maps, enhancing discriminative power for actions defined by localized motion.
Figure 2: SkeletonContext pipeline combining CCPM and KPD, with LLM-driven context guidance enabling precise cross-modal semantic grounding and alignment.

Context Description Generation

Context descriptions are structured via prompt engineering for LLMs to obtain compositional semantics associated with each action, e.g., "In office, hand uses pen to write on paper." Multiple context descriptions are generated for each action to expand intra-class diversity and facilitate contextual knowledge transfer between categories sharing similar semantics.

The CCPM reconstructs context prompts at progressively harder masking ratios using the Progressive Partial Masking (PPM) curriculum, enabling stable optimization and gradual semantic induction.

Differential Joint Encoding and Semantic Grounding

Skeleton representations are enriched through the Differential Joint Encoder (DJE), explicitly modeling fine-grained spatial dependencies. Inter-joint differences are leveraged to infer likely context, further refined through cross-attention mechanisms with language tokens. The semantic context grounding objective (context reconstruction loss, $\mathcal{L}_{ccr}$ ) enforces plausible context generation guided jointly by skeleton motion dynamics and LLM priors.

Key-Part Decoupling: Motion-Critical Focus

KPD predicts joint-importance maps using action-specific language priors, calibrated via a key-part decoupling loss ( $\mathcal{L}_{kpd}$ ). This module robustly highlights informative body regions for each action, generalizing to unseen classes sharing structural motion patterns.

Figure 3: KPD visualization with attention focused on motion-critical joints for each action, aligned to semantic priors.

Experimental Results

SkeletonContext was evaluated on NTU-RGB+D 60/120 and PKU-MMD benchmarks under both ZSL and GZSL protocols. The framework consistently achieves state-of-the-art performance with strong harmonic mean (H) scores indicative of balanced generalization between seen and unseen classes. Notably, SkeletonContext demonstrates robust transfer under random unseen-class splits and outperforms prior art in fine-grained action discrimination across ambiguous clusters.

Key numerical results:

NTU-60 (55/5 split): ZSL accuracy of 89.6%, GZSL harmonic mean 77.1%
NTU-120 (96/24 split): ZSL accuracy of 60.1%, GZSL harmonic mean 56.1%
PKU-MMD (46/5 splits): ZSL 73.5%, GZSL harmonic mean 71.4%

Ablation studies confirm cumulative benefits of DJE, SCG, PPM, and KPD, with each module contributing to improved context induction and motion discrimination. Removal of context reconstruction loss or key-part decoupling loss results in statistically significant degeneration ( $\sim$ 2-3% drop).

Fine-Grained Semantic Reasoning and Qualitative Analysis

SkeletonContext's context reconstruction is validated qualitatively: the model reliably distinguishes visually similar actions such as "typing" and "writing" by correctly inferring contextual objects (keyboard/tablet vs. pen/paper). Similarly, ambiguous clusters e.g., "putting on glasses" vs. "putting on a hat" are resolved via accurate context prediction, demonstrating robust semantic grounding from skeleton-only input.

Figure 4: SkeletonContext qualitative results showing inferred contextual objects for visually ambiguous actions, guaranteeing fine-grained discrimination.

Implications and Future Directions

SkeletonContext validates that language-driven context injection substantially bridges the semantic gap in skeleton-based action recognition, enabling robust cross-modal generalization and fine-grained action discrimination. Practically, this advances privacy-preserving action recognition for applications including intelligent surveillance, healthcare analytics, and HCI with minimal dependency on appearance cues. Theoretically, SkeletonContext offers a unified approach for multimodal grounding in abstract visual domains, laying foundations for extending to few-shot learning, video reasoning, and embodied AI.

Further research can investigate:

Extension to temporally complex or continuous action streams.
Integration with video-level or multi-modal cues for enhanced context induction.
Unsupervised or semi-supervised adaptation to real-world settings.

Conclusion

SkeletonContext introduces a principled, context-aware framework for ZSSAR, coupling language-driven contextual reconstruction with motion-centric key-part decoupling. Extensive quantitative and qualitative evidence substantiates its superiority in balancing performance across seen and unseen categories as well as distinguishing visually similar actions. The approach demonstrates practical relevance and theoretical potential for the broader AI community, especially in multimodal understanding and semantic grounding tasks.

Markdown Report Issue