Dynamic Gesture Recognition by Using CNNs and Star RGB: a Temporal Information Condensation

Published 10 Apr 2019 in cs.CV, cs.HC, cs.LG, and cs.RO | (1904.08505v2)

Abstract: Due to the advance of technologies, machines are increasingly present in people's daily lives. Thus, there has been more and more effort to develop interfaces, such as dynamic gestures, that provide an intuitive way of interaction. Currently, the most common trend is to use multimodal data, as depth and skeleton information, to enable dynamic gesture recognition. However, using only color information would be more interesting, since RGB cameras are usually available in almost every public place, and could be used for gesture recognition without the need of installing other equipment. The main problem with such approach is the difficulty of representing spatio-temporal information using just color. With this in mind, we propose a technique capable of condensing a dynamic gesture, shown in a video, in just one RGB image. We call this technique star RGB. This image is then passed to a classifier formed by two Resnet CNNs, a soft-attention ensemble, and a fully connected layer, which indicates the class of the gesture present in the input video. Experiments were carried out using both Montalbano and GRIT datasets. For Montalbano dataset, the proposed approach achieved an accuracy of 94.58%. Such result reaches the state-of-the-art when considering this dataset and only color information. Regarding the GRIT dataset, our proposal achieves more than 98% of accuracy, recall, precision, and F1-score, outperforming the reference approach by more than 6%.

Abstract PDF Upgrade to Chat

Citations (52)

View on Semantic Scholar

Summary

The paper presents a novel dynamic gesture recognition method using only standard RGB cameras by condensing temporal information into a single "star RGB" image processed by an ensemble of ResNet CNNs.
This approach achieved high accuracy, reaching 94.58% on the challenging Montalbano dataset and over 98% on the GRIT dataset, surpassing previous methods that relied on multimodal data.
The findings demonstrate the potential for cost-effective and practical dynamic gesture recognition in real-world settings using widely available standard RGB cameras, reducing the need for specialized depth or skeleton sensors.

Dynamic Gesture Recognition Using CNNs and Star RGB

In the paper titled "Dynamic Gesture Recognition by Using CNNs and star RGB: a Temporal Information Condensation," the authors present a novel approach to the problem of dynamic gesture recognition using standard RGB cameras. Unlike traditional methods that rely on multimodal data, such as depth and skeleton information, this work focuses exclusively on color data. This choice is based on the widespread availability of RGB cameras in public spaces, making the approach more practical and cost-effective.

Methodology

The core innovation of the paper is the "star RGB" technique, which condenses the temporal information of a dynamic gesture into a single RGB image. The process begins by splitting the video of a gesture into three parts: pre-stroke, stroke, and post-stroke, each capturing a distinct temporal segment of the gesture. The star representation, traditionally computed as a grayscale image through motion history techniques, is enhanced to utilize RGB color space, encoding more temporal information through cosine similarity measures.

The star RGB image is then input to a classifier system employing two ResNet Convolutional Neural Networks (CNNs), ResNet-50 and ResNet-101. These CNNs are part of an ensemble, with features from each network weighted through a soft-attention mechanism before classification via connected layers and a softmax function. This ensemble approach leverages transfer learning from pre-trained models on ImageNet, optimizing the gesture recognition process without extensive training on massive gesture-specific datasets.

Results

Experiments are conducted on two datasets: the Montalbano dataset, comprising roughly 13,000 anthropological gestures distributed across 20 classes, and the GRIT dataset, containing gestures purposed for human-robot interaction. The proposed approach achieved a recognition accuracy of 94.58% on the challenging Montalbano dataset, significantly outperforming previous methods reliant on multimodal data. On the GRIT dataset, which is more specialized and contains fewer classes, the system achieved above 98% accuracy, recall, precision, and F1-score.

Implications and Future Work

The findings of the paper highlight the potential for RGB-based dynamic gesture recognition in real-world applications, particularly in environments equipped with standard surveillance cameras. The approach reduces the need for specialized equipment like Kinect or Intel RealSense, broadening the accessibility and adaptability of gesture recognition technologies.

Further, the paper suggests future work could focus on addressing scenarios where hand shape plays a critical role in gesture differentiation—an area not deeply explored in this work. Improving the algorithm's efficacy in environments with moving cameras through techniques such as homography estimation could also enhance its robustness and accuracy.

Overall, the proposed star RGB representation alongside CNN recognition frameworks presents a robust solution for dynamic gesture recognition, paving the way for more inclusive and flexible human-machine interaction systems.

Markdown Report Issue