Papers
Topics
Authors
Recent
Search
2000 character limit reached

Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM

Published 21 Feb 2021 in cs.CV and cs.LG | (2102.10590v3)

Abstract: Automatically detecting violence from surveillance footage is a subset of activity recognition that deserves special attention because of its wide applicability in unmanned security monitoring systems, internet video filtration, etc. In this work, we propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet where one stream takes in background suppressed frames as inputs and other stream processes difference of adjacent frames. We employed simple and fast input pre-processing techniques that highlight the moving objects in the frames by suppressing non-moving backgrounds and capture the motion in-between frames. As violent actions are mostly characterized by body movements these inputs help produce discriminative features. SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution that enables producing robust long-range Spatio-temporal features while using substantially fewer parameters. We experimented with three fusion methods to combine the output feature maps of the two streams. Evaluation of the proposed methods was done on three standard public datasets. Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin while matching state-of-the-art results on the smaller datasets. Our experiments lead us to conclude, the proposed models are superior in terms of both computational efficiency and detection accuracy.

Citations (36)

Summary

  • The paper introduces an efficient two-stream architecture combining CNN and SepConvLSTM to extract spatio-temporal features for violence detection.
  • The study employs robust pre-processing techniques like background suppression and frame differencing to enhance feature extraction and reduce noise.
  • Experimental results on benchmark datasets reveal state-of-the-art performance with a significant reduction in computational requirements.

Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM

The paper "Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM" (2102.10590) presents a novel approach for violence detection in video footage using an efficient two-stream deep learning architecture. This method combines CNN and SepConvLSTM modules to produce robust Spatio-temporal features for classifying video clips as violent or non-violent.

Introduction

Violence detection is a crucial task within human activity recognition, aimed at identifying aggressive behaviors such as fighting or rioting in surveillance footage. The paper introduces a two-stream architecture that leverages background suppression and frame differences to enhance detection capabilities. The proposed method utilizes pre-processing techniques to highlight moving objects and employs Separable Convolutional LSTM (SepConvLSTM) for reducing the computational overhead while maintaining performance. Figure 1

Figure 1: A schematic overview of the proposed method for violence detection.

Separable Convolutional LSTM

SepConvLSTM is an integral component of the proposed architecture, designed to handle long-range Spatio-temporal features effectively. It utilizes depthwise separable convolutions at each gate instead of traditional convolutions, reducing the parameter count and computational requirements. This approach maintains the spatial features learned by CNNs and is particularly suited for tasks requiring efficient temporal encoding.

Mathematical Formulation

The operations inside SepConvLSTM are expressed as follows:

it=σ(1×1Wix∗(Wix⊛xt)+1×1Wih∗(Wih⊛ht−1)+bi) ft=σ(1×1Wfx∗(Wfx⊛xt)+1×1Wfh∗(Wfh⊛ht−1)+bf) ct~=τ(1×1Wcx∗(Wcx⊛xt)+1×1Wch∗(Wch⊛ht−1)+bc) ot=σ(1×1Wox∗(Wox⊛xt)+1×1Woh∗(Woh⊛ht−1)+bo) ct=ft⊗ct−1+it⊗ct~ ht=ot⊗τ(ct)\begin{align} i_t &= \sigma( {}_{1 \times 1}W_i^x * (W_i^x \circledast x_t) + {}_{1 \times 1}W_i^h * (W_i^h \circledast h_{t-1}) + b_i ) \ f_t &= \sigma( {}_{1 \times 1}W_f^x * (W_f^x \circledast x_t) + {}_{1 \times 1}W_f^h * (W_f^h \circledast h_{t-1}) + b_f ) \ \tilde{c_t} &= \tau ( {}_{1 \times 1}W_c^x * (W_c^x \circledast x_t) + {}_{1 \times 1}W_c^h * (W_c^h \circledast h_{t-1}) + b_c ) \ o_t &= \sigma( {}_{1 \times 1}W_o^x * (W_o^x \circledast x_t) + {}_{1 \times 1}W_o^h * (W_o^h \circledast h_{t-1}) + b_o ) \ c_t &= f_t \otimes c_{t-1} + i_t \otimes \tilde{c_t} \ h_t &= o_t \otimes \tau(c_t) \end{align}

Here, * represents convolution, ⊗\otimes is the Hadamard product, σ\sigma is the sigmoid activation, and τ\tau is the tanh activation.

Pre-processing Techniques

The proposed model employs efficient input pre-processing techniques, such as background suppression and frame differences, to focus on relevant movement patterns while suppressing static background information. Figure 2

Figure 2: Input pre-processing for the proposed model.

Network Architecture

The network is composed of two separate streams with similar architecture. Each stream utilizes a truncated MobileNet module and SepConvLSTM to generate spatial features, which are then combined using a Fusion layer for classification. The architecture ensures that both appearance-based and motion-based information are accounted for when classifying video clips as violent or non-violent. Figure 3

Figure 3: The proposed model architecture consisting of two CNN-LSTM streams.

Experimental Evaluation

The proposed models were evaluated on three standard benchmark datasets for violence detection: RWF-2000, Hockey, and Movies. The models outperformed previous methods in terms of accuracy on the RWF-2000 dataset and matched state-of-the-art results on the smaller datasets. The implementation showcases the efficiency and adaptability of the proposed architecture with a significant reduction in parameters compared to traditional approaches.

Efficiency Analysis

The paper demonstrates the efficiency of the SepConvLSTM module and its integration within the architecture compared to other models. The reduction in the parameter count and the resulting computational savings, especially the Floating Point Operations (FLOPs), make the model suitable for deployment in resource-constrained environments.

Qualitative Results

Qualitative results on the RWF-2000 dataset highlight the model's capability to correctly predict violence even in ambiguous scenarios. However, some failure cases were observed in situations with occluded persons or poor-quality footage, reinforcing the need for robust environmental adaptation. Figure 4

Figure 4: Qualitative results of the proposed model in different video scenarios.

Conclusion

The paper presents an innovative approach to violence detection that combines efficient architectural components and novel pre-processing techniques to achieve high detection accuracy. Future work could explore deeper SepConvLSTM layers or the integration of object-level features to further enhance performance. The proposed network provides a solid foundation for practical deployments in surveillance systems and other real-time applications.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.