FTDMamba: UAV Video Anomaly Detection Architecture
- FTDMamba is a video anomaly detection architecture that separates UAV-induced global motion from object-centric motion using dual-path frequency and temporal modules.
- The network employs an encoder–decoder framework augmented by the Frequency Decoupled Spatiotemporal Correlation Module (FDSCM) and Temporal Dilation Mamba Module (TDMM) to capture multi-scale features.
- Validated on the MUVAD dataset, FTDMamba achieves state-of-the-art robustness in dynamic aerial scenes by integrating FFT-based analysis and state-space sequence modeling.
The Frequency-Assisted Temporal Dilation Mamba (FTDMamba) network is a video anomaly detection (VAD) architecture explicitly designed to address the challenges presented by unmanned aerial vehicle (UAV) video with dynamic backgrounds. It resolves the difficulties posed by coupled global (UAV-induced) and local (object) motion through parallelized frequency analysis and multi-scale temporal modeling, setting state-of-the-art (SOTA) performance benchmarks for both static and non-static aerial scenes (Liu et al., 16 Jan 2026).
1. Architectural Overview
FTDMamba implements an encoder–decoder prediction framework, augmented between the encoder and decoder by two parallel, complementary modules:
- Frequency Decoupled Spatiotemporal Correlation Module (FDSCM)
- Temporal Dilation Mamba Module (TDMM)
A four-stage Pyramid Vision Transformer encodes video input into hierarchical features . These features are simultaneously processed by FDSCM and TDMM, producing and , which are then concatenated channel-wise, projected back to the original dimension, and passed to a U-Net-like decoder consisting of up-convolutional blocks with skip connections. The decoder predicts the future frame . This dual-path strategy incorporates both global-local motion disentanglement (via frequency-domain methods) and fine-to-coarse temporal modeling (via state-space sequence modeling).
2. Frequency Decoupled Spatiotemporal Correlation Module (FDSCM)
FDSCM leverages 1D and 2D Fast Fourier Transforms (FFT) at two levels:
2.1 Temporal Frequency Decoupling
Given features over batch , time , channels , and spatial dimensions :
- Normalized frequency coordinates:
- 1D FFT along time:
- Amplitude spectrum:
- Frequency-dependent weighting:
- Inverse FFT for denoised features:
This process emphasizes frequency bands that best separate global, UAV-induced motion from object-centric motion.
2.2 Spatiotemporal Correlation Modeling
Spatial dimensions are flattened as , .
- 2D FFT over (time, space):
- Power spectral density (PSD):
- Autocorrelation via inverse 2D FFT:
- Attentioned feature composition:
This yields features that capture joint global spatiotemporal dependencies, supporting effective separation of scene and object motion.
3. Temporal Dilation Mamba Module (TDMM)
TDMM exploits the Mamba structured state-space model, applying multi-scale and multi-scan strategies to extract temporal patterns across both short and long-range contexts.
3.1 Spatiotemporal Mamba (STMamba) Core
- Feature projection and normalization generate input and gating .
- Hybrid scan families:
- Pixel-wise temporal-first: processes each pixel’s -length temporal sequence
- Patch-wise spatial-first: divides each frame into patches, tracking spatiotemporal evolution of each
- Scan implementation: Each of $6$ forward + $6$ backward scan sequences is processed as
- Gated summation and skip connection:
where .
3.2 Multi-scale Temporal Dilation
TDMM applies STMamba over multiple temporal dilation rates :
- Reversible reshaping extracts and temporally subsamples sequences at rate .
- Dilation-aggregated processing:
This combination yields representations sensitive to both slow, UAV-induced global changes (large ) and fast, object-centric local changes (small ), enhancing discrimination between normal and anomalous events in dynamic videos.
4. Training Objectives and Optimization Strategy
4.1 Loss Functions
FTDMamba uses a weighted sum of three loss terms:
- Intensity loss:
- Gradient loss: Measures discrepancies in horizontal and vertical image gradients:
- Structural similarity : Computed at multiple resolutions.
- Total weighted loss:
4.2 Training Protocol
- Input: Six consecutive frames as context to predict the seventh.
- Preprocessing: Resizing to , pixel normalization to .
- Optimization: AdamW with cosine-annealing learning rate schedule, 200 epochs, batch size $8$ on two RTX 3090 GPUs. Learning rates: (Drone-Anomaly, MUVAD), (UIT-ADrone).
- TDMM: STMamba depth $1$, patch size , dilations .
5. Moving UAV VAD (MUVAD) Dataset
A large-scale dataset, MUVAD, is introduced to address the lack of suitable dynamic-background UAV VAD data.
| Split | Clips | Frames | Anomalies (Events, Types) | Resolution |
|---|---|---|---|---|
| Train | 46 | 126,254 | 0 (only normal) | |
| Test | 72 | 96,482 | 240 (12 types) |
- FPS: 30, dense frame-level binary annotation of anomaly presence for test set.
- Anomaly types (Table I): Illegal lane change (21), Emergency lane violation (39), Wrong-way driving (15), Pedestrian intrusion (41), among others.
- Annotation: Multi-annotator cross-validation.
- Preprocessing: Filtering to exclude blurred, edited, non-UAV sources; resizing; normalization.
6. Empirical Performance and Analysis
6.1 Quantitative Results
FTDMamba outperforms existing methods by significant margins:
| Dataset | Micro-AUC | Macro-AUC | EER | SOTA Margin |
|---|---|---|---|---|
| Drone-Anomaly | 71.6% | 72.3% | 0.336 | +4% |
| UIT-ADrone | 70.7% | 69.5% | 0.368 | |
| MUVAD | 71.4% | 68.4% | 0.372 |
FTDMamba consistently surpasses ground-surveillance (e.g., MA-PDM, VAD-Mamba) and UAV-specific baselines (ANDT, ASTT, HSTforU) in both static and dynamic scenarios.
6.2 Ablation and Component Analysis
- FDSCM: Addition increases Micro-AUC by +3.1% (UIT-ADrone), +5.8% (MUVAD). Omitting temporal frequency decoupling or spatiotemporal correlation causes 2–4% drops.
- TDMM: STMamba (with FDSCM) adds +5.8%/+7.2%. Multi-scale temporal (MST) dilation provides further +5.4%/+3.8%.
- Scan strategies: Hybrid pixel-temporal + patch-spatial superior to single-mode.
- STMamba depth: Depth $1$ chosen; deeper layers yield negligible accuracy gains with halved throughput.
- Parallelism: Parallel FDSCM+TDMM outperforms cascaded variants by 2.0–4.2%.
6.3 Robustness
- Gaussian noise (σ up to 100): <2% drop; σ=250 yields 63.9% AUC.
- Random occlusion (up to 30% missing frames): <3.5% performance loss; 50% missing frames: AUC 62.5–64.7%.
6.4 Qualitative Insights
- Anomaly scores: Correlated sharply with actual anomalies, robust even under heavy UAV motion jitter.
- Error maps: Highlight anomalous spatiotemporal regions matching ground truth annotations.
7. Relevance and Research Implications
FTDMamba demonstrates effective integration of physics-inspired frequency decomposition with advanced state-space sequence modeling. The parallel use of FDSCM and TDMM allows explicit disentanglement of background and object motion while capturing a hierarchy of temporal correlations. The introduction of MUVAD addresses a gap in benchmarking VAD under realistic UAV dynamics, enabling future research into more generalizable models. The robust empirical results establish FTDMamba as a reference method for aerial VAD in both academic and applied contexts (Liu et al., 16 Jan 2026).