- The paper introduces an efficient two-stream architecture combining CNN and SepConvLSTM to extract spatio-temporal features for violence detection.
- The study employs robust pre-processing techniques like background suppression and frame differencing to enhance feature extraction and reduce noise.
- Experimental results on benchmark datasets reveal state-of-the-art performance with a significant reduction in computational requirements.
Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM
The paper "Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM" (2102.10590) presents a novel approach for violence detection in video footage using an efficient two-stream deep learning architecture. This method combines CNN and SepConvLSTM modules to produce robust Spatio-temporal features for classifying video clips as violent or non-violent.
Introduction
Violence detection is a crucial task within human activity recognition, aimed at identifying aggressive behaviors such as fighting or rioting in surveillance footage. The paper introduces a two-stream architecture that leverages background suppression and frame differences to enhance detection capabilities. The proposed method utilizes pre-processing techniques to highlight moving objects and employs Separable Convolutional LSTM (SepConvLSTM) for reducing the computational overhead while maintaining performance.
Figure 1: A schematic overview of the proposed method for violence detection.
Separable Convolutional LSTM
SepConvLSTM is an integral component of the proposed architecture, designed to handle long-range Spatio-temporal features effectively. It utilizes depthwise separable convolutions at each gate instead of traditional convolutions, reducing the parameter count and computational requirements. This approach maintains the spatial features learned by CNNs and is particularly suited for tasks requiring efficient temporal encoding.
The operations inside SepConvLSTM are expressed as follows:
it​​=σ(1×1​Wix​∗(Wix​⊛xt​)+1×1​Wih​∗(Wih​⊛ht−1​)+bi​) ft​​=σ(1×1​Wfx​∗(Wfx​⊛xt​)+1×1​Wfh​∗(Wfh​⊛ht−1​)+bf​) ct​~​​=τ(1×1​Wcx​∗(Wcx​⊛xt​)+1×1​Wch​∗(Wch​⊛ht−1​)+bc​) ot​​=σ(1×1​Wox​∗(Wox​⊛xt​)+1×1​Woh​∗(Woh​⊛ht−1​)+bo​) ct​​=ft​⊗ct−1​+it​⊗ct​~​ ht​​=ot​⊗τ(ct​)​​
Here, * represents convolution, ⊗ is the Hadamard product, σ is the sigmoid activation, and τ is the tanh activation.
Pre-processing Techniques
The proposed model employs efficient input pre-processing techniques, such as background suppression and frame differences, to focus on relevant movement patterns while suppressing static background information.
Figure 2: Input pre-processing for the proposed model.
Network Architecture
The network is composed of two separate streams with similar architecture. Each stream utilizes a truncated MobileNet module and SepConvLSTM to generate spatial features, which are then combined using a Fusion layer for classification. The architecture ensures that both appearance-based and motion-based information are accounted for when classifying video clips as violent or non-violent.
Figure 3: The proposed model architecture consisting of two CNN-LSTM streams.
Experimental Evaluation
The proposed models were evaluated on three standard benchmark datasets for violence detection: RWF-2000, Hockey, and Movies. The models outperformed previous methods in terms of accuracy on the RWF-2000 dataset and matched state-of-the-art results on the smaller datasets. The implementation showcases the efficiency and adaptability of the proposed architecture with a significant reduction in parameters compared to traditional approaches.
Efficiency Analysis
The paper demonstrates the efficiency of the SepConvLSTM module and its integration within the architecture compared to other models. The reduction in the parameter count and the resulting computational savings, especially the Floating Point Operations (FLOPs), make the model suitable for deployment in resource-constrained environments.
Qualitative Results
Qualitative results on the RWF-2000 dataset highlight the model's capability to correctly predict violence even in ambiguous scenarios. However, some failure cases were observed in situations with occluded persons or poor-quality footage, reinforcing the need for robust environmental adaptation.
Figure 4: Qualitative results of the proposed model in different video scenarios.
Conclusion
The paper presents an innovative approach to violence detection that combines efficient architectural components and novel pre-processing techniques to achieve high detection accuracy. Future work could explore deeper SepConvLSTM layers or the integration of object-level features to further enhance performance. The proposed network provides a solid foundation for practical deployments in surveillance systems and other real-time applications.