Papers
Topics
Authors
Recent
Search
2000 character limit reached

NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

Published 12 May 2019 in cs.CV | (1905.04757v2)

Abstract: Research on depth-based human activity analysis achieved outstanding performance and demonstrated the effectiveness of 3D representation for action recognition. The existing depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of large-scale training samples, realistic number of distinct class categories, diversity in camera views, varied environmental conditions, and variety of human subjects. In this work, we introduce a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames. This dataset contains 120 different action classes including daily, mutual, and health-related activities. We evaluate the performance of a series of existing 3D activity analysis methods on this dataset, and show the advantage of applying deep learning methods for 3D-based human action recognition. Furthermore, we investigate a novel one-shot 3D activity recognition problem on our dataset, and a simple yet effective Action-Part Semantic Relevance-aware (APSR) framework is proposed for this task, which yields promising results for recognition of the novel action classes. We believe the introduction of this large-scale dataset will enable the community to apply, adapt, and develop various data-hungry learning techniques for depth-based and RGB+D-based human activity understanding. [The dataset is available at: http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]

Citations (1,144)

Summary

  • The paper presents NTU RGB+D 120, a large-scale dataset with over 114,000 videos across 120 diverse action categories.
  • It leverages Microsoft Kinect v2 to capture four data modalities, enhancing cross-view and cross-environment activity recognition.
  • The study introduces benchmark evaluations and the APSR framework, advancing one-shot 3D action recognition with semantic relevance.

NTU RGB+D 120: A Comprehensive Dataset for 3D Human Activity Understanding

Introduction

The paper "NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding" (arXiv ID: (1905.04757)) presents a significant contribution to the field of 3D human activity analysis. The introduction of the NTU RGB+D 120 dataset addresses the critical shortcomings of existing datasets by providing a large-scale, diverse, and comprehensive resource designed to significantly advance research in RGB+D vision and 3D action recognition.

Dataset Overview

The NTU RGB+D 120 dataset is distinguished by its scale and diversity. It includes over 114,000 video samples collected from 106 distinct subjects, encompassing over 8 million frames. The dataset is organized into 120 action categories, which are divided into daily activities, mutual activities, and health-related activities. This extensive categorization and large subject pool are crucial for capturing realistic intra-class variations and ensuring the dataset's applicability to a wide range of real-world scenarios. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Sample frames of the NTU RGB+D 120 dataset illustrating diverse subjects, camera views, environmental conditions, and data modalities.

Technical Specifications

The dataset is acquired using Microsoft Kinect v2 sensors, capturing four data modalities: RGB, depth, 3D skeletal joints, and infrared (IR) sequences. This multimodal approach ensures a comprehensive representation of human activities, combining visual, geometric, and depth information. The dataset includes 155 distinct camera viewpoints and varies significantly in terms of camera distances, heights, and environments, enriching the possibilities for cross-view and cross-environment evaluations. Figure 2

Figure 2: Illustration of the configuration of 25 body joints in the dataset.

Benchmark Evaluations

The paper rigorously evaluates contemporary 3D activity recognition methods on this dataset using defined cross-subject and cross-setup evaluation criteria, providing a robust benchmark for future research. The results highlight the efficacy of deep learning techniques in leveraging the dataset's richness, demonstrating that recognition performance significantly improves with data fusion across RGB, depth, and skeletal data modalities.

One-Shot 3D Action Recognition

A novel aspect explored in this work is the one-shot 3D activity recognition problem. The authors introduce the Action-Part Semantic Relevance-aware (APSR) framework, which utilizes semantic embeddings to emphasize relevant body parts based on the action context. The APSR framework shows promising results by efficiently generalizing to novel action classes using semantic relevance scores derived from pre-trained word embeddings. Figure 3

Figure 3: Estimating semantic relevance score between action and body part text descriptions using Word2Vec.

Implications and Future Directions

The availability of the NTU RGB+D 120 dataset opens several research avenues. It provides a fertile ground for testing hypotheses concerning data modality fusion, cross-view invariance, and robust human action classification in diverse environments. Its scale allows for deep learning models to be pre-trained effectively, potentially enhancing performance on smaller datasets and specific tasks like one-shot learning and early action recognition.

Practical implications include improvements in surveillance systems, human-computer interaction, and assistive technologies, where recognizing diverse and nuanced human actions in real-time is crucial. The insight into semantic relevance could further enhance natural language processing tasks by bridging visual data and language.

Conclusion

The NTU RGB+D 120 dataset extends the frontier of 3D human activity understanding. By providing a large-scale, richly annotated, multimodal resource, it is set to catalyze advancements across computational vision and AI methods for human action recognition. The dataset's comprehensive nature addresses key limitations in the field, paving the way for innovations in both theoretical research and practical applications.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.