VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

Published 22 Aug 2023 in cs.CV and cs.MM | (2308.11681v3)

Abstract: The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide range of image-level tasks, revealing remarkable ability for learning powerful visual representations with rich semantics. An open and worthwhile problem is efficiently adapting such a strong model to the video domain and designing a robust video anomaly detector. In this work, we propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) by leveraging the frozen CLIP model directly without any pre-training and fine-tuning process. Unlike current works that directly feed extracted features into the weakly supervised classifier for frame-level binary classification, VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP and involves dual branch. One branch simply utilizes visual features for coarse-grained binary classification, while the other fully leverages the fine-grained language-image alignment. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained video anomaly detection by transferring pre-trained knowledge from CLIP to WSVAD task. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD, surpassing the state-of-the-art methods by a large margin. Specifically, VadCLIP achieves 84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime, respectively. Code and features are released at https://github.com/nwpu-zxr/VadCLIP.

Abstract PDF HTML Upgrade to Chat

References (39)

Citations (41)

View on Semantic Scholar

Summary

The paper presents VadCLIP, a system that leverages dual branch design to combine coarse visual classification with fine-grained vision-language semantic alignment.
It demonstrates superior performance on XD-Violence and UCF-Crime datasets, achieving an AP of 84.51% and an AUC of 88.02% without further pre-training.
The approach integrates a Local-Global Temporal Adapter and MIL-Align mechanism to effectively capture temporal relationships and optimize anomaly detection under weak supervision.

VadCLIP: Adapting Vision-LLMs for Weakly Supervised Video Anomaly Detection

This paper presents VadCLIP, a paradigm leveraging the pre-trained CLIP model for weakly supervised video anomaly detection (WSVAD). The authors address the challenge of transferring the capabilities of vision-LLMs, originally trained on image-text pairs, to perform efficiently on the more nuanced task of video anomaly detection.

The key innovation of VadCLIP lies in its dual branch structure, which exploits both coarse-grained and fine-grained visual representations. One branch handles visual features for traditional binary classification, while the other employs vision-language alignment to harness semantic associations between video content and textual descriptions. This approach is intended to maximize the utility of CLIP's learned knowledge without further pre-training or fine-tuning, a significant departure from conventional WSVAD methods that predominantly rely on feature extraction and binary classification paradigms.

Empirical results substantiate the effectiveness of VadCLIP. In experiments conducted on the XD-Violence and UCF-Crime datasets, VadCLIP achieved an average precision (AP) of 84.51% and an area under the curve (AUC) of 88.02%, respectively, outperforming state-of-the-art methods by notable margins. These improvements underscore VadCLIP's advantage over both weakly supervised and semi-supervised techniques by fully leveraging cross-modal associations.

From a theoretical standpoint, VadCLIP represents a meaningful step towards domain adaptation in the video context, where temporal dependencies and semantic alignments play a critical role. Noteworthy components contributing to the system's performance include the Local-Global Temporal Adapter (LGT-Adapter) for capturing temporal relations and novel prompt mechanisms that effectively bridge the visual-language gap. The learnable and anomaly-focused visual prompts dynamically refine class embeddings with contextual information, thereby improving the model's discriminative power in distinguishing anomalies.

The MIL-Align mechanism further optimizes vision-language alignment under weak supervision, highlighting an adaptive strategy to utilize unlabeled data in refining the detection capabilities. This methodological shift not only expands the capabilities of CLIP to the video domain but also sets a precedent for similar transformations across different modalities.

Looking ahead, the insights from this work open new avenues for enhancing video anomaly detection systems by integrating state-of-the-art vision-LLMs. Such advancements could contribute significantly to the development of intelligent surveillance and video analysis systems with improved detection accuracy and reduced dependency on extensive labeled datasets.

Future research could explore the implications of leveraging multi-modal data in open-set conditions or incorporating additional modalities, such as audio, for a more holistic understanding of video contexts necessary for precise anomaly detection. This line of investigation will be crucial for further advancing the potential of pre-trained models in complex, real-world anomaly detection scenarios.