Papers
Topics
Authors
Recent
Search
2000 character limit reached

Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

Published 30 Apr 2024 in cs.CV and cs.AI | (2405.00181v2)

Abstract: Video anomaly understanding (VAU) aims to automatically comprehend unusual occurrences in videos, thereby enabling various applications such as traffic surveillance and industrial manufacturing. While existing VAU benchmarks primarily concentrate on anomaly detection and localization, our focus is on more practicality, prompting us to raise the following crucial questions: "what anomaly occurred?", "why did it happen?", and "how severe is this abnormal event?". In pursuit of these answers, we present a comprehensive benchmark for Causation Understanding of Video Anomaly (CUVA). Specifically, each instance of the proposed benchmark involves three sets of human annotations to indicate the "what", "why" and "how" of an anomaly, including 1) anomaly type, start and end times, and event descriptions, 2) natural language explanations for the cause of an anomaly, and 3) free text reflecting the effect of the abnormality. In addition, we also introduce MMEval, a novel evaluation metric designed to better align with human preferences for CUVA, facilitating the measurement of existing LLMs in comprehending the underlying cause and corresponding effect of video anomalies. Finally, we propose a novel prompt-based method that can serve as a baseline approach for the challenging CUVA. We conduct extensive experiments to show the superiority of our evaluation metric and the prompt-based approach. Our code and dataset are available at https://github.com/fesvhtr/CUVA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Armstrong Aboah. A vision-based system for traffic anomaly detection using deep learning and decision trees. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4207–4212, 2021.
  2. Ubnormal: New benchmark for supervised open-set video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20143–20153, 2022.
  3. Robust real-time unusual event detection using multiple fixed-location monitors. IEEE transactions on pattern analysis and machine intelligence, 30(3):555–560, 2008.
  4. Video summarization using deep neural networks: A survey. Proceedings of the IEEE, 109(11):1838–1863, 2021.
  5. Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890, 2023.
  6. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  7. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023.
  8. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20392–20401, 2023a.
  9. A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20392–20401, 2023b.
  10. Video anomaly detection with spatio-temporal dissociation. Pattern Recognition, 122:108213, 2022.
  11. Camnuvem: A robbery dataset for video anomaly detection. Sensors, 22(24):10016, 2022.
  12. Openprompt: An open-source framework for prompt-learning. arXiv preprint arXiv:2111.01998, 2021.
  13. Any-shot sequential anomaly detection in surveillance videos. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4037–4042, 2020.
  14. Qafacteval: Improved qa-based factual consistency evaluation for summarization. arXiv preprint arXiv:2112.08542, 2021.
  15. You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2448–2460, 2023.
  16. Split: Single portrait lighting estimation via a tetrad of face intrinsics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(02):1079–1092, 2024.
  17. Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14773–14783, 2023.
  18. Anomaly detection in video via self-supervised and multi-task learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12742–12752, 2021.
  19. Neuromorphic camera guided high dynamic range imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  20. Adbench: Anomaly detection benchmark. Advances in Neural Information Processing Systems, 35:32142–32159, 2022.
  21. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2763–2775, Dublin, Ireland, 2022. Association for Computational Linguistics.
  22. Rankgen: Improving text generation with large ranking models. arXiv preprint arXiv:2205.09726, 2022.
  23. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  24. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
  25. Local-global context aware transformer for language-guided video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):10055–10069, 2023.
  26. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  27. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023.
  28. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  29. Bird’s-eye-view scene graph for vision-language navigation, 2023b.
  30. Generating anomalies for video anomaly detection with prompt-based feature mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24500–24510, 2023c.
  31. Prompt Generation Networks for Input-based Adaptation of Frozen Vision Transformers. arXiv e-prints, art. arXiv:2210.06466, 2022.
  32. Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models. arXiv e-prints, art. arXiv:2106.13353, 2021.
  33. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE international conference on computer vision, pages 2720–2727, 2013.
  34. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE international conference on computer vision, pages 341–349, 2017.
  35. Unbiased multiple instance learning for weakly supervised video anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8022–8031, 2023.
  36. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  37. Abnormal crowd behavior detection using social force model. In 2009 IEEE conference on computer vision and pattern recognition, pages 935–942. IEEE, 2009.
  38. Interventional video grounding with dual contrastive learning. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2764–2774, 2021.
  39. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1273–1283, 2019.
  40. OpenAI. Chatgpt. https://www.openai.com/gpt-3, 2022. Accessed: November 12, 2023.
  41. Adnet: Temporal anomaly detection in surveillance videos. arXiv preprint arXiv:2104.06653, 2021.
  42. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  43. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  44. Street scene: A new dataset and evaluation protocol for video anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2569–2578, 2020.
  45. DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. arXiv e-prints, art. arXiv:2112.01518, 2021.
  46. Multi-timescale trajectory prediction for abnormal human activity detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2626–2634, 2020.
  47. Deep semi-supervised anomaly detection. arXiv preprint arXiv:1906.02694, 2019.
  48. Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696, 2020.
  49. Eval: Explainable video anomaly localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18717–18726, 2023.
  50. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  51. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018.
  52. Rareanom: A benchmark video dataset for rare type anomalies. Pattern Recognition, 140:109567, 2023.
  53. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986, 2021.
  54. People detection and pose classification inside a moving train using computer vision. In Advances in Visual Informatics: 5th International Visual Informatics Conference, IVIC 2017, Bangi, Malaysia, November 28–30, 2017, Proceedings 5, pages 319–330. Springer, 2017.
  55. Dreamwalker: Mental planning for continuous vision-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10873–10883, 2023a.
  56. Anomaly detection in crowd scene. In IEEE 10th International Conference on Signal Processing Proceedings, pages 1220–1223. IEEE, 2010.
  57. Lana: A language-capable navigator for instruction following and generation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19048–19058, 2023b.
  58. Self-supervised sparse representation for video anomaly detection. In European Conference on Computer Vision, pages 729–745. Springer, 2022.
  59. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 322–339. Springer, 2020.
  60. Yinglin Xia. Chapter eleven - correlation and association analyses in microbiome study integrating multiomics in health and disease. In The Microbiome in Health and Disease, pages 309–491. Academic Press, 2020.
  61. Funqa: Towards surprising video comprehension, 2023.
  62. A critical evaluation of evaluations for long-form question answering, 2023a.
  63. Meta compositional referring expression segmentation. pages 19478–19487, 2023b.
  64. Tad: A large-scale benchmark for traffic accidents detection from video surveillance, 2022.
  65. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 2021.
  66. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  67. Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988, 2023.
  68. Generative cooperative learning for unsupervised video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14744–14754, 2022.
  69. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  70. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China, 2019. Association for Computational Linguistics.
  71. Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023–2038, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.
  72. Continual semantic segmentation with automatic memory sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3082–3092, 2023.
Citations (8)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.