Video Anomaly Understanding (VAU)
- Video Anomaly Understanding (VAU) is a framework that detects, localizes, interprets, and explains anomalous events in video streams using causal reasoning.
- It leverages advanced multimodal models, hierarchical annotations, and prompt-based methodologies to achieve fine-grained semantic and temporal analysis.
- VAU systems provide practical insights for risk assessment and mitigation by integrating multi-stage reasoning and explicit explanation protocols.
Video Anomaly Understanding (VAU) encompasses the detection, localization, interpretation, and causal explanation of anomalous events in video streams. VAU systems aim not only to flag anomalous segments, but also to answer human-relevant questions such as "what happened?", "why did it occur?", and "how severe is it?"—substantially extending traditional video anomaly detection (VAD) frameworks. Current research leverages advanced multimodal LLMs (MLLMs), hierarchical and context-aware benchmarks, and explicit causation-oriented evaluation protocols to drive VAU beyond shallow framewise scoring into the field of explainable, risk-aware scene comprehension.
1. Scope and Objectives of Video Anomaly Understanding
VAU formalizes a set of intertwined challenges in video analysis that extend the scope of VAD:
- Detection: Identifying temporal intervals where anomalous events occur.
- Localization: Determining spatial or spatiotemporal extents of events within frames or segments.
- Semantic Interpretation: Generating natural-language descriptions of anomalies, participating entities, and contextual factors.
- Causal Reasoning: Explaining why an anomaly occurred, including the underlying triggers, object interactions, and causal chains.
- Risk Assessment: Judging the severity and potential impact of observed anomalies, and optionally proposing mitigation or action recommendations.
Benchmarks such as CUVA and ECVA organize VAU as answering What (anomaly and event type with temporal bounds and descriptions), Why (natural-language causal explanation), and How (severity or effect) for each annotated anomalous event (Du et al., 2024).
Contemporary task definitions further include unified protocols that require models to simultaneously ground anomaly intervals and provide fine-grained semantic answers, as in VAGU (Gao et al., 29 Jul 2025), FineVAU (Pereira et al., 24 Jan 2026), and VAR (Huang et al., 15 Jan 2026).
2. Dataset Resources and Annotation Paradigms
The progression from VAD to VAU is driven by the introduction of richly annotated, multi-granular datasets that enable structured evaluation:
- CUVA: 1,000 real-world videos with 42 anomaly subcategories; each video annotated for anomaly type and temporal interval, cause description, and effect/severity (Du et al., 2024).
- HIVAU-70k: Over 70,000 hierarchical instruction annotations (clip, event, video) spanning UCF-Crime and XD-Violence, supporting judgment, descriptive, and causal analysis at variable granularity (Zhang et al., 2024).
- CueBench: 2,950 real-world videos annotated with an event-centric, hierarchical taxonomy (absolute and conditional anomalies), scenes, and attributes, supporting tasks from recognition through anticipation (Yu et al., 1 Nov 2025).
- FineVAU: Builds on automatic enrichment of human annotations to provide fine-grained labeling of What (events), Who (entities), and Where (locations), resulting in 1,544 clips with 17,813 events, 59,392 entities, and 7,669 location attributes (Pereira et al., 24 Jan 2026).
- VAGU: Annotations for anomaly category, semantic explanation, precise temporal intervals, and multiple-choice QA covering 21 categories and >7,500 clips (Gao et al., 29 Jul 2025).
- Pistachio: Synthetic video benchmark engineered for domain balance, temporal complexity, and fine-grained storyline annotation (41 s per clip, 31 anomaly types) to surface new VAU challenges (Li et al., 22 Nov 2025).
Annotation protocols combine manual segmentation, LLM-generated free-form annotations, structured rationales, and instruction pools to yield datasets suitable for comprehensive VAU assessment.
3. Methodologies and Model Architectures
Recent VAU methodologies synthesize advances in spatiotemporal modeling, vision-language pretraining, prompt engineering, and reinforcement learning:
- Baseline Generative Methods: Early frameworks include ConvLSTM-β-VAEs trained on normal video for unsupervised detection via reconstruction error (Waseem et al., 2022).
- Foundation MLLM Pipelines: VAU now leverages off-the-shelf MLLMs (e.g., Video-LLaMA, BLIP-2, Otter), often enhanced with domain-specific prompt engineering or LoRA adapters for efficient domain alignment (Zhang et al., 2024, Lin et al., 2 Nov 2025, Yu et al., 1 Nov 2025).
- Prompt-Based Baselines: "Hard" prompts (task-specific system/user messages) and "soft" prompts (segment selection, temporal gating) focus model attention on critical video segments for improved descriptive and causal accuracy (Du et al., 2024, Zhang et al., 2024).
- Relation-Aware Modeling: VADER incorporates object-centric relation encoding via a COntrastive Relation Encoder (CORE) and explicit context-aware event sampling to enable causal reasoning over object interactions (Cheng et al., 10 Nov 2025).
- Hierarchical and Contextual Sampling: Holmes-VAU’s Anomaly-focused Temporal Sampler (ATS) and VADTree’s Granularity-Aware Tree sample anomaly-rich and structurally significant segments, respectively, ensuring efficiency and contextual coverage (Zhang et al., 2024, Li et al., 26 Oct 2025).
- Reinforcement Fine-Tuning: VAU-R1, Cue-R1, and Vad-R1-Plus apply group-relative policy optimization to align multistage outputs (reasoning chains, temporal spans, labels) with verifiable task-specific rewards (format, structure, hierarchy, risk accuracy) (Zhu et al., 29 May 2025, Yu et al., 1 Nov 2025, Huang et al., 15 Jan 2026).
- Chain-of-Thought and Multi-Stage Reasoning: VAR and VAU-R1 employ multi-level reasoning templates and output structured Perception–Cognition–Action traces before final answers, facilitating explainable and risk-aware anomaly diagnosis (Huang et al., 15 Jan 2026, Zhu et al., 29 May 2025).
4. Evaluation Metrics and Protocols
The complexity of VAU demands multimodal, task-aligned, and human-aligned evaluation:
| Metric | Scope | Features |
|---|---|---|
| BLEU/ROUGE | Text similarity | N-gram overlap; limited factual sensitivity |
| MMEval (Du et al., 2024) | Multimodal, causal alignment | Evaluates description, cause, and effect; video+text prompts; higher human consistency |
| FV-Score (Pereira et al., 24 Jan 2026) | Fine-grained, human-aligned | Judgement on explicit What/Who/Where tuples with learnable semantic match |
| JeAUG (Gao et al., 29 Jul 2025) | Joint semantic and temporal | Combines LLM-rated semantic explanation and interval IoU, video-length adjusted |
| Hierarchy/Structure (Yu et al., 1 Nov 2025) | Event-centric taxonomic similarity | Structure F1, semantic score, hierarchy path length, temporal IoU |
| Reasoning Quality (Huang et al., 15 Jan 2026) | Perception–Cognition–Action chain | Stage-wise accuracy, risk assessment, semantic consistency |
Empirical studies confirm that FV-Score and MMEval match human preference significantly better than n-gram or open-ended LLM ratings; JeAUG robustly penalizes explanation–localization dissociation.
5. Empirical Findings and Diagnostic Insights
Findings from VAU benchmarks and method evaluations include:
- Causal and Contextual Reasoning: Models augmented with explicit prompt engineering, relation encoding, and hierarchical sampling yield superior causal explanation scores (VADER: MMEval up to 66.3 on CUVA (Cheng et al., 10 Nov 2025); Holmes-VAU BLEU up to 0.804 at event level (Zhang et al., 2024)).
- Annotation and Task Granularity: Systems instruction-tuned or evaluated over hierarchical data (clip/event/video, context triplets) outperform those assessed at a single granularity, indicating the necessity for multi-scale reasoning (Yu et al., 1 Nov 2025, Zhang et al., 2024).
- Fine-Grained Understanding Gaps: All open-source LVLMs exhibit normalcy bias, struggle with brief or subtle anomalies, and largely hallucinate non-existent events in fine-grained What scoring (event accuracy ≈12% in FineVAU (Pereira et al., 24 Jan 2026)).
- Prompt and Sampling Efficiency: Training-free test-time reasoning (e.g., VADTree, PrismVAU) achieves competitive AUC with substantial computational savings, but quality in causal explanation remains inferior to full instruction-finetuned systems (Erregue et al., 6 Jan 2026, Li et al., 26 Oct 2025).
- Human-Alignment: Discordance between traditional metrics and human judges persists except where explicit grounding in event/entity/location or causal chain is enforced (Pereira et al., 24 Jan 2026).
- Multi-Modal Fusion and Decision Making: VAR and VAU-R1 frameworks, by integrating multi-modal and risk-aware reasoning, achieve highest open-ended and task QA scores (Vad-R1-Plus MCQ accuracy 96.4% on VAR (Huang et al., 15 Jan 2026)).
6. Open Challenges and Directions
Persistent challenges include:
- Domain Generalization: Model robustness across scene domains, video sources, and anomaly types is limited; most evaluation and training regimes remain surveillance-focused (Zhang et al., 2024, Li et al., 22 Nov 2025).
- Causal and Risk Understanding: Current systems discretize risk (low/medium/high), with little support for continuous severity assessment or operationalized action recommendations (Huang et al., 15 Jan 2026).
- Annotation Cost: Hierarchical, context-rich annotation pipelines remain costly; future efforts aim for scalable semi-automatic labeling (Yu et al., 1 Nov 2025, Zhang et al., 2024).
- Real-Time Processing and Streaming: Efficient, context-aware sampling and memory-augmented architectures are needed for deployment on live streams (Li et al., 26 Oct 2025, Zhang et al., 2024).
- Fine-Grained Detection: Models universally underperform on fleeting, multi-actor, or ambiguous events; targeted fine-tuning and new curriculum strategies are active research frontiers (Pereira et al., 24 Jan 2026).
- Evaluation Standardization: New metrics steadily approach human alignment, but universal standards for open-ended VAU evaluation are still in flux (Du et al., 2024, Pereira et al., 24 Jan 2026).
Advancing VAU thus requires co-evolution of benchmarks, model architectures, and evaluation protocols that couple precise temporal/spatial grounding with explicit, interpretable, and causally structured semantic reasoning.