Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning

Published 25 Jan 2021 in cs.CV | (2101.10030v3)

Abstract: Anomaly detection with weakly supervised video-level labels is typically formulated as a multiple instance learning (MIL) problem, in which we aim to identify snippets containing abnormal events, with each video represented as a bag of video snippets. Although current methods show effective detection performance, their recognition of the positive instances, i.e., rare abnormal snippets in the abnormal videos, is largely biased by the dominant negative instances, especially when the abnormal events are subtle anomalies that exhibit only small differences compared with normal events. This issue is exacerbated in many methods that ignore important video temporal dependencies. To address this issue, we introduce a novel and theoretically sound method, named Robust Temporal Feature Magnitude learning (RTFM), which trains a feature magnitude learning function to effectively recognise the positive instances, substantially improving the robustness of the MIL approach to the negative instances from abnormal videos. RTFM also adapts dilated convolutions and self-attention mechanisms to capture long- and short-range temporal dependencies to learn the feature magnitude more faithfully. Extensive experiments show that the RTFM-enabled MIL model (i) outperforms several state-of-the-art methods by a large margin on four benchmark data sets (ShanghaiTech, UCF-Crime, XD-Violence and UCSD-Peds) and (ii) achieves significantly improved subtle anomaly discriminability and sample efficiency. Code is available at https://github.com/tianyu0207/RTFM.

Abstract PDF Upgrade to Chat

Citations (249)

View on Semantic Scholar

Summary

The paper introduces RTFM, which enhances anomaly detection by learning temporal feature magnitudes through weak supervision.
It leverages pyramid dilated convolutions and temporal self-attention to capture both local and global temporal dependencies in video data.
RTFM achieves superior AUC results on benchmarks like ShanghaiTech and UCF-Crime, even with limited abnormal training data.

Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning

Video anomaly detection plays a crucial role in surveillance and security applications, with the primary goal being the identification of anomalous sequences within video data. The paper "Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning" introduces a novel approach named Robust Temporal Feature Magnitude (RTFM) learning to enhance anomaly detection using weakly-supervised methods.

Problem Statement

Traditional anomaly detection methods often struggle with the identification of rare abnormal instances within videos that predominantly contain normal events. These methods are biased due to the overwhelming presence of negative instances and the subtle differences between abnormal and normal snippets. The paper addresses these challenges by proposing a robust mechanism to distinguish anomalous snippets more effectively.

RTFM Overview

RTFM improves the recognition of positive instances in videos through learning a feature magnitude function that separates abnormal snippets from normal ones based on temporal dependencies.

Feature Magnitude Learning: The method leverages the magnitude of feature vectors, using $\ell_2$ norms to distinguish between normal and abnormal snippets.
Temporal Dependencies: Dilated convolutions and self-attention mechanisms are incorporated to capture temporal dependencies, both long-range and short-range, across video sequences.
Loss Functions: Two loss functions are used—one to maximize the separability of abnormal and normal video features, and the other for snippet classification.
Figure 1: RTFM trains a feature magnitude learning function to improve the robustness of MIL approaches to normal snippets from abnormal videos, and detect abnormal snippets more effectively.

Architecture of RTFM

The architecture involves extracting temporal features using a multi-scale temporal network composed of pyramid dilated convolutions (PDC) and temporal self-attention (TSA).

Pyramid Dilated Convolutions (PDC): Applied to capture various temporal scales without losing resolution.
Temporal Self-Attention (TSA): Models temporal context by correlating snippets across the video sequence.

The output is a concatenation of features from the PDC and TSA, refined using skip connections.

Figure 2: Our proposed MTN consists of two modules. The module on the left uses pyramid dilated convolutions to capture local consecutive snippets dependency over different temporal scales. The module on the right relies on a self-attention network to compute global temporal correlations.

Experimental Results

The RTFM method is evaluated on four benchmark datasets: ShanghaiTech, UCF-Crime, XD-Violence, and UCSD-Peds, showing significant improvements over existing methods. Notable results include:

ShanghaiTech: RTFM achieves a AUC of 97.21%, significantly outperforming prior weakly-supervised methods, demonstrating robust feature magnitude learning capabilities.
UCF-Crime: The method results in an 84.30% AUC, exceeding previous benchmarks by enhancing sample efficiency and effective snippet separation.
Sample Efficiency: RTFM shows robustness even with reduced abnormal training data, outperforming others using full datasets.
Figure 3: AUC w.r.t. the number of abnormal training videos.

Figure 4: AUC results w.r.t. individual classes on UCF-Crime.

Implications and Future Work

RTFM presents a scalable approach to video anomaly detection by ensuring that abnormal snippets are distinguished by their feature magnitude, thus providing a stronger learning signal and enabling improved detection across various datasets. Future work will aim to further optimize temporal feature learning frameworks and explore extensions into real-time applications for video processing.

Conclusion

The RTFM approach significantly enhances anomaly detection by integrating robust temporal feature magnitude learning. This paradigm shift not only improves detection accuracy across multiple datasets but also ensures better discriminability of subtle anomalies while maintaining computational efficiency. With strong empirical and theoretical support, RTFM sets a new standard in weakly-supervised video anomaly detection.