Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition

Published 27 Feb 2018 in cs.CV | (1802.09834v1)

Abstract: Variations of human body skeletons may be considered as dynamic graphs, which are generic data representation for numerous real-world applications. In this paper, we propose a spatio-temporal graph convolution (STGC) approach for assembling the successes of local convolutional filtering and sequence learning ability of autoregressive moving average. To encode dynamic graphs, the constructed multi-scale local graph convolution filters, consisting of matrices of local receptive fields and signal mappings, are recursively performed on structured graph data of temporal and spatial domain. The proposed model is generic and principled as it can be generalized into other dynamic models. We theoretically prove the stability of STGC and provide an upper-bound of the signal transformation to be learnt. Further, the proposed recursive model can be stacked into a multi-layer architecture. To evaluate our model, we conduct extensive experiments on four benchmark skeleton-based action datasets, including the large-scale challenging NTU RGB+D. The experimental results demonstrate the effectiveness of our proposed model and the improvement over the state-of-the-art.

Abstract PDF Upgrade to Chat

Citations (168)

View on Semantic Scholar

Summary

Spatio-Temporal Graph Convolution for Skeleton-Based Action Recognition

The paper "Spatio-Temporal Graph Convolution for Skeleton-Based Action Recognition" by Chaolong Li et al. proposes a novel approach to model human actions using a spatio-temporal graph convolution (STGC) technique, focusing on skeletal motion recognition. The approach effectively combines local convolutional filtering with recursive sequence learning, drawing inspiration from autoregressive moving average (ARMA) models.

Overview

The authors argue that traditional skeletal-based recognition methods often struggle with the dynamic and irregular structure of human skeletons. To address these challenges, they introduce a method that models human skeletal motion as dynamic graphs, allowing for analysis over both spatial and temporal domains. This technique utilizes multi-scale graph convolutional filters to capture local receptive fields and signal mappings, thus leveraging the structured graph data in both spatial and temporal dimensions.

Methodology

Graph Representation: The paper discusses representing human skeletons as graphs where nodes represent joints and edges represent bones. The dynamic nature of human motion is captured by modeling sequences of these graphs.
Convolutional Filters: The proposed STGC employs multi-scale convolutional filters designed using polynomials of adjacency matrices. These filters operate at multiple scales to extract localized spatial and temporal information from skeletal data, potentially improving the robustness of action recognition.
Recursive Learning: The model incorporates recursive convolutional layers to account for motion dynamics, inspired by ARMA models. The recursive procedure ensures that temporal variations in skeletal motion are effectively encoded in the hidden states of the network.
Stability and Transformation: The theoretical basis for model stability is established, ensuring convergence with an upper bound on signal transformation. This mathematical rigor adds a layer of reliability to the performance of STGC in real-world applications.

Results

The STGC approach was evaluated extensively on several benchmark datasets: Florence 3D, HDM05, Large Scale Combined (LSC), and NTU RGB+D. The results indicated improved performance over existing state-of-the-art models, particularly in terms of accuracy on challenging datasets such as NTU RGB+D, which is notable for its size and complexity.

Implications

The STGC approach provides significant advancements in the field of skeleton-based action recognition, enabling more accurate and stable recognition of human motion against dynamic backgrounds and under diverse conditions. This tooling provides clear advantages for applications in video surveillance, robotics, and human-computer interaction.

Speculations on Future Developments

The proposed STGC model opens pathways for applying deep learning techniques on graph data structures in dynamic environments. Future research may focus on:

Enhancing computational efficiency and scalability to deal with even larger datasets and real-time applications.
Integrating STGC with other modalities such as audio or visual cues for richer multimodal action recognition frameworks.
Exploring deeper architectures and fusion techniques to further enhance performance in diverse application scenarios.

Overall, the paper provides a well-founded approach with real-world applications in dynamic human action recognition, representing a promising direction in AI research focused on graph-structured data.