Active Learning for GCN-based Action Recognition

Published 26 Nov 2025 in cs.CV | (2511.21625v1)

Abstract: Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of labeled data, which are frequently scarce in practical settings. To address this limitation, we propose a novel label-efficient GCN model. Our work makes two primary contributions. First, we develop a novel acquisition function that employs an adversarial strategy to identify a compact set of informative exemplars for labeling. This selection process balances representativeness, diversity, and uncertainty. Second, we introduce bidirectional and stable GCN architectures. These enhanced networks facilitate a more effective mapping between the ambient and latent data spaces, enabling a better understanding of the learned exemplar distribution. Extensive evaluations on two challenging skeleton-based action recognition benchmarks reveal significant improvements achieved by our label-efficient GCNs compared to prior work.

Abstract PDF Upgrade to Chat

Authors (1)

Hichem Sahbi

Summary

The paper introduces a label-efficient active learning paradigm using an adversarial acquisition function to select informative examples for skeleton action recognition.
It employs a stable bidirectional GCN framework with bi-Lipschitz continuity and orthonormal regularization to ensure robust latent space mapping.
Experiments on SBU and FPHA datasets demonstrate significant accuracy gains over traditional sampling methods under constrained labeling budgets.

Label-Efficient Active Learning for GCN-Based Skeleton Action Recognition

Problem Motivation and Context

Skeleton-based action recognition leverages spatio-temporal skeletal data, typically represented as graph structures, to identify and classify human actions. Graph convolutional networks (GCNs) have established a new state-of-the-art in this domain by exploiting the inherent topology of skeleton joints. However, the performance of such models is predicated upon access to substantial volumes of labeled data—a resource that is expensive and labor-intensive to obtain, particularly for high-fidelity skeletal annotations. Conventional approaches to mitigating annotation scarcity, such as data augmentation, transfer, few-shot, or self-supervised learning, have limited effectiveness when direct annotation is budget-constrained. Consequently, this work proposes a label-efficient active learning paradigm specifically tailored to GCN-based action recognition.

Methodology: Adversarial Acquisition Function and Stable Bidirectional GCNs

Novel Acquisition Function via Probabilistic Display Model

The methodological centerpiece is a display model—the acquisition function for selecting instances to be labeled—rooted in a probabilistic, adversarial formulation. Unlike classical sampling from static unlabeled pools, the proposed model synthesizes candidate exemplars by minimizing an objective that simultaneously enforces data representativeness, diversity, and uncertainty. Specifically, exemplars are constructed by optimizing a set of memberships $\mu$ , yielding display sets that are both diverse with respect to previously labeled samples and maximally representative of the overall data distribution. The inclusion of uncertainty maximizes acquisition of samples near decision boundaries, expediting convergence of learned classifiers.

Latent Space Design via Bidirectional Stable GCNs

To circumvent limitations posed by the nonlinear nature of ambient data representations, the acquisition function is applied in the latent space induced by stable, invertible, bidirectional GCNs. The bidirectionality is rigorously defined using bi-Lipschitz continuity, with provable guarantees that the mapping and its inverse remain robust to input perturbations, contingent on bounded condition numbers and activation function slopes. To operationalize this, GCN training is regularized towards orthonormal weight matrices, yielding networks whose latent spaces are tractable (e.g., Gaussian), stable under inversion, and amenable to efficient exemplar synthesis. Further stability is achieved through weight reparametrization, directly shifting eigenvalues to minimize the condition number, thus ensuring stable back-mapping of designed exemplars to the ambient space.

Experimental Analysis and Results

Benchmarking is performed on SBU Interaction and First Person Hand Action (FPHA) datasets, both characterized by complex spatio-temporal action taxonomies. Input skeletons are encoded as temporal-trajectory graphs with robust feature chunking. The display model—both in ambient and latent space—demonstrates superior selection of informative samples, leading to state-of-the-art classification accuracy under constrained labeling regimes.

Notable Results

At 45% labeling rate on SBU, latent-space exemplar design achieves 93.84% accuracy versus 89.23% (random selection), 83.07% (diversity sampling), and 67.69% (uncertainty sampling).
At 15% labeling rate, proposed latent display achieves 75.38% accuracy—an order of magnitude above uncertainty and diversity baselines.
On FPHA, analogous gains are observed: 75.65% at 45% labeling with latent display (vs. 75.47% random, 70.26% diversity, 63.3% uncertainty).
Regularization studies show orthogonality constraints result in highest accuracy and lowest Fréchet Inception Distance (FID) and condition number (CN), confirming tight latent-to-ambient sample stability—OR regularizer alone achieves 93.84% on SBU and 75.65% on FPHA.

Theoretical and Practical Implications

The adversarial exemplar design in bidirectional latent spaces provides principled selection of informative samples beyond heuristic or static-pool approaches. This enables highly label-efficient training regimes without auxiliary generative networks. The bi-Lipschitz stability of the network ensures interpretability and robustness in the designed display sets, advancing theoretical understanding of invertible graph architectures for representation learning. Practically, the framework is domain-agnostic, providing scalability to other modalities and annotation-scarce tasks and immediate impact in surveillance, HCI, and robotics where skeleton action datasets remain costly.

Prospects for Future Research

Extension to multimodal action recognition: Integrating RGB-D and audio alongside skeleton graphs may leverage the display model for cross-modal information acquisition.
Hybrid selection strategies: The proposed acquisition could be combined with reinforcement learning or dynamic budget allocation, potentially enhancing performance in open-set action taxonomies.
Few-shot and domain adaptation: Bidirectional latent space design offers promise for unsupervised domain transfer and zero-shot action recognition, aligning with current directions in self-supervised representation learning.

Conclusion

This work formalizes a label-efficient active learning framework for GCN-based skeleton action recognition that outperforms existing methods, especially under stringent labeling budgets. Its contributions—principled adversarial exemplar selection, stable bidirectional graph mappings, and rigorous empirical validation—set a strong precedent for annotation-efficient architecture design and graph-based deep learning (2511.21625).

Markdown Report Issue