FTDMamba: UAV Video Anomaly Detection Architecture

Updated 23 January 2026

FTDMamba is a video anomaly detection architecture that separates UAV-induced global motion from object-centric motion using dual-path frequency and temporal modules.
The network employs an encoder–decoder framework augmented by the Frequency Decoupled Spatiotemporal Correlation Module (FDSCM) and Temporal Dilation Mamba Module (TDMM) to capture multi-scale features.
Validated on the MUVAD dataset, FTDMamba achieves state-of-the-art robustness in dynamic aerial scenes by integrating FFT-based analysis and state-space sequence modeling.

The Frequency-Assisted Temporal Dilation Mamba (FTDMamba) network is a video anomaly detection (VAD) architecture explicitly designed to address the challenges presented by unmanned aerial vehicle (UAV) video with dynamic backgrounds. It resolves the difficulties posed by coupled global (UAV-induced) and local (object) motion through parallelized frequency analysis and multi-scale temporal modeling, setting state-of-the-art (SOTA) performance benchmarks for both static and non-static aerial scenes (Liu et al., 16 Jan 2026).

1. Architectural Overview

FTDMamba implements an encoder–decoder prediction framework, augmented between the encoder and decoder by two parallel, complementary modules:

Frequency Decoupled Spatiotemporal Correlation Module (FDSCM)
Temporal Dilation Mamba Module (TDMM)

A four-stage Pyramid Vision Transformer encodes video input into hierarchical features $f_i\in\mathbb{R}^{B\times T\times C\times H\times W}$ . These features are simultaneously processed by FDSCM and TDMM, producing $\bar f_i$ and $\tilde f_i$ , which are then concatenated channel-wise, projected back to the original dimension, and passed to a U-Net-like decoder consisting of up-convolutional blocks with skip connections. The decoder predicts the future frame $\hat Y$ . This dual-path strategy incorporates both global-local motion disentanglement (via frequency-domain methods) and fine-to-coarse temporal modeling (via state-space sequence modeling).

2. Frequency Decoupled Spatiotemporal Correlation Module (FDSCM)

FDSCM leverages 1D and 2D Fast Fourier Transforms (FFT) at two levels:

2.1 Temporal Frequency Decoupling

Given features $f(b,t,c,h,w)$ over batch $B$ , time $T$ , channels $C$ , and spatial dimensions $H \times W$ :

Normalized frequency coordinates:

$l_k = \begin{cases} \tfrac{k}{T}, &0\le k\le \lfloor T/2\rfloor\ \tfrac{k-T}{T}, &\lfloor T/2\rfloor<k<T \end{cases}$

1D FFT along time:

$\bar f_i$ 0

Amplitude spectrum:

$\bar f_i$ 1

Frequency-dependent weighting:

$\bar f_i$ 2

Inverse FFT for denoised features:

$\bar f_i$ 3

This process emphasizes frequency bands that best separate global, UAV-induced motion from object-centric motion.

2.2 Spatiotemporal Correlation Modeling

Spatial dimensions are flattened as $\bar f_i$ 4, $\bar f_i$ 5.

2D FFT over (time, space):

$\bar f_i$ 6

Power spectral density (PSD):

$\bar f_i$ 7

Autocorrelation via inverse 2D FFT:

$\bar f_i$ 8

Attentioned feature composition:

$\bar f_i$ 9

This yields features that capture joint global spatiotemporal dependencies, supporting effective separation of scene and object motion.

3. Temporal Dilation Mamba Module (TDMM)

TDMM exploits the Mamba structured state-space model, applying multi-scale and multi-scan strategies to extract temporal patterns across both short and long-range contexts.

3.1 Spatiotemporal Mamba (STMamba) Core

Feature projection and normalization generate input $\tilde f_i$ 0 and gating $\tilde f_i$ 1.
Hybrid scan families:
- Pixel-wise temporal-first: processes each pixel’s $\tilde f_i$ 2-length temporal sequence
- Patch-wise spatial-first: divides each frame into $\tilde f_i$ 3 patches, tracking spatiotemporal evolution of each
Scan implementation: Each of $\tilde f_i$ 4 forward + $\tilde f_i$ 5 backward scan sequences $\tilde f_i$ 6 is processed as

$\tilde f_i$ 7

Gated summation and skip connection:

$\tilde f_i$ 8

where $\tilde f_i$ 9.

3.2 Multi-scale Temporal Dilation

TDMM applies STMamba over multiple temporal dilation rates $\hat Y$ 0:

Reversible reshaping $\hat Y$ 1 extracts and temporally subsamples sequences at rate $\hat Y$ 2.
Dilation-aggregated processing:

$\hat Y$ 3

This combination yields representations sensitive to both slow, UAV-induced global changes (large $\hat Y$ 4) and fast, object-centric local changes (small $\hat Y$ 5), enhancing discrimination between normal and anomalous events in dynamic videos.

4. Training Objectives and Optimization Strategy

4.1 Loss Functions

FTDMamba uses a weighted sum of three loss terms:

Intensity loss:

$\hat Y$ 6

Gradient loss: Measures discrepancies in horizontal and vertical image gradients:

$\hat Y$ 7

Structural similarity $\hat Y$ 8: Computed at multiple resolutions.
Total weighted loss:

$\hat Y$ 9

4.2 Training Protocol

Input: Six consecutive frames as context to predict the seventh.
Preprocessing: Resizing to $f(b,t,c,h,w)$ 0, pixel normalization to $f(b,t,c,h,w)$ 1.
Optimization: AdamW with cosine-annealing learning rate schedule, 200 epochs, batch size $f(b,t,c,h,w)$ 2 on two RTX 3090 GPUs. Learning rates: $f(b,t,c,h,w)$ 3 (Drone-Anomaly, MUVAD), $f(b,t,c,h,w)$ 4 (UIT-ADrone).
TDMM: STMamba depth $f(b,t,c,h,w)$ 5, patch size $f(b,t,c,h,w)$ 6, dilations $f(b,t,c,h,w)$ 7.

5. Moving UAV VAD (MUVAD) Dataset

A large-scale dataset, MUVAD, is introduced to address the lack of suitable dynamic-background UAV VAD data.

Split	Clips	Frames	Anomalies (Events, Types)	Resolution
Train	46	126,254	0 (only normal)	$f(b,t,c,h,w)$ 8
Test	72	96,482	240 (12 types)	$f(b,t,c,h,w)$ 9

FPS: 30, dense frame-level binary annotation of anomaly presence for test set.
Anomaly types (Table I): Illegal lane change (21), Emergency lane violation (39), Wrong-way driving (15), Pedestrian intrusion (41), among others.
Annotation: Multi-annotator cross-validation.
Preprocessing: Filtering to exclude blurred, edited, non-UAV sources; resizing; normalization.

6. Empirical Performance and Analysis

6.1 Quantitative Results

FTDMamba outperforms existing methods by significant margins:

Dataset	Micro-AUC	Macro-AUC	EER	SOTA Margin
Drone-Anomaly	71.6%	72.3%	0.336	+4%
UIT-ADrone	70.7%	69.5%	0.368
MUVAD	71.4%	68.4%	0.372

FTDMamba consistently surpasses ground-surveillance (e.g., MA-PDM, VAD-Mamba) and UAV-specific baselines (ANDT, ASTT, HSTforU) in both static and dynamic scenarios.

6.2 Ablation and Component Analysis

FDSCM: Addition increases Micro-AUC by +3.1% (UIT-ADrone), +5.8% (MUVAD). Omitting temporal frequency decoupling or spatiotemporal correlation causes 2–4% drops.
TDMM: STMamba (with FDSCM) adds +5.8%/+7.2%. Multi-scale temporal (MST) dilation provides further +5.4%/+3.8%.
Scan strategies: Hybrid pixel-temporal + patch-spatial superior to single-mode.
STMamba depth: Depth $B$ 0 chosen; deeper layers yield negligible accuracy gains with halved throughput.
Parallelism: Parallel FDSCM+TDMM outperforms cascaded variants by 2.0–4.2%.

6.3 Robustness

Gaussian noise (σ up to 100): <2% drop; σ=250 yields 63.9% AUC.
Random occlusion (up to 30% missing frames): <3.5% performance loss; 50% missing frames: AUC 62.5–64.7%.

6.4 Qualitative Insights

Anomaly scores: Correlated sharply with actual anomalies, robust even under heavy UAV motion jitter.
Error maps: Highlight anomalous spatiotemporal regions matching ground truth annotations.

7. Relevance and Research Implications

FTDMamba demonstrates effective integration of physics-inspired frequency decomposition with advanced state-space sequence modeling. The parallel use of FDSCM and TDMM allows explicit disentanglement of background and object motion while capturing a hierarchy of temporal correlations. The introduction of MUVAD addresses a gap in benchmarking VAD under realistic UAV dynamics, enabling future research into more generalizable models. The robust empirical results establish FTDMamba as a reference method for aerial VAD in both academic and applied contexts (Liu et al., 16 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

FTDMamba: Frequency-Assisted Temporal Dilation Mamba for Unmanned Aerial Vehicle Video Anomaly Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Frequency-Assisted Temporal Dilation Mamba (FTDMamba) Network.