Complementary advantages and training strategies for hybrid VAD architectures

Determine effective methods to achieve complementary advantages across CNN-based, Transformer-based, and Mamba-based frameworks when combined into a hybrid architecture for unsupervised video anomaly detection, and develop optimal training strategies for such hybrid models to exploit their respective strengths while maintaining efficiency.

Background

The paper surveys three major architecture families used in unsupervised video anomaly detection: CNN-based models for local spatial feature extraction, Transformer-based models for global dependency modeling, and Mamba-based state space models for efficient long-sequence processing.

While VADMamba++ proposes a specific hybrid backbone integrating Transformer, Mamba, and CNN components, the authors explicitly acknowledge that, at a field level, it remains unresolved how to systematically combine these frameworks to harness their complementary strengths and how to devise optimal training strategies for such hybrids. This unresolved issue motivates broader investigation into principled model design and training protocols for hybrid VAD architectures.

References

Despite these architectural advances, how to achieve complementary advantages across different frameworks and explore optimal training strategies for hybrid architectures remains an open challenge.

VADMamba++: Efficient Video Anomaly Detection via Hybrid Modeling in Grayscale Space  (2604.00360 - Lyu et al., 1 Apr 2026) in Related Work, end of section