Papers
Topics
Authors
Recent
Search
2000 character limit reached

FTDMamba: UAV Video Anomaly Detection Architecture

Updated 23 January 2026
  • FTDMamba is a video anomaly detection architecture that separates UAV-induced global motion from object-centric motion using dual-path frequency and temporal modules.
  • The network employs an encoder–decoder framework augmented by the Frequency Decoupled Spatiotemporal Correlation Module (FDSCM) and Temporal Dilation Mamba Module (TDMM) to capture multi-scale features.
  • Validated on the MUVAD dataset, FTDMamba achieves state-of-the-art robustness in dynamic aerial scenes by integrating FFT-based analysis and state-space sequence modeling.

The Frequency-Assisted Temporal Dilation Mamba (FTDMamba) network is a video anomaly detection (VAD) architecture explicitly designed to address the challenges presented by unmanned aerial vehicle (UAV) video with dynamic backgrounds. It resolves the difficulties posed by coupled global (UAV-induced) and local (object) motion through parallelized frequency analysis and multi-scale temporal modeling, setting state-of-the-art (SOTA) performance benchmarks for both static and non-static aerial scenes (Liu et al., 16 Jan 2026).

1. Architectural Overview

FTDMamba implements an encoder–decoder prediction framework, augmented between the encoder and decoder by two parallel, complementary modules:

  • Frequency Decoupled Spatiotemporal Correlation Module (FDSCM)
  • Temporal Dilation Mamba Module (TDMM)

A four-stage Pyramid Vision Transformer encodes video input into hierarchical features fi∈RB×T×C×H×Wf_i\in\mathbb{R}^{B\times T\times C\times H\times W}. These features are simultaneously processed by FDSCM and TDMM, producing fˉi\bar f_i and f~i\tilde f_i, which are then concatenated channel-wise, projected back to the original dimension, and passed to a U-Net-like decoder consisting of up-convolutional blocks with skip connections. The decoder predicts the future frame Y^\hat Y. This dual-path strategy incorporates both global-local motion disentanglement (via frequency-domain methods) and fine-to-coarse temporal modeling (via state-space sequence modeling).

2. Frequency Decoupled Spatiotemporal Correlation Module (FDSCM)

FDSCM leverages 1D and 2D Fast Fourier Transforms (FFT) at two levels:

2.1 Temporal Frequency Decoupling

Given features f(b,t,c,h,w)f(b,t,c,h,w) over batch BB, time TT, channels CC, and spatial dimensions H×WH \times W:

  1. Normalized frequency coordinates:

lk={kT,0≤k≤⌊T/2⌋ k−TT,⌊T/2⌋<k<Tl_k = \begin{cases} \tfrac{k}{T}, &0\le k\le \lfloor T/2\rfloor\ \tfrac{k-T}{T}, &\lfloor T/2\rfloor<k<T \end{cases}

  1. 1D FFT along time:

f^k(b,c,h,w)=∑t=0T−1f(b,t,c,h,w) e−j2πkt/T\hat f_k(b,c,h,w) = \sum_{t=0}^{T-1} f(b,t,c,h,w)\,e^{-j2\pi kt/T}

  1. Amplitude spectrum:

Ak(b,c,h,w)=∣f^k∣=ℜ(f^k)2+ℑ(f^k)2A_k(b,c,h,w) = |\hat f_k| = \sqrt{\Re(\hat f_k)^2 + \Im(\hat f_k)^2}

  1. Frequency-dependent weighting:

wk(b,c,h,w)=lk2 Ak(b,c,h,w)2,f^k′=f^k⋅wkw_k(b,c,h,w) = l_k^2\,A_k(b,c,h,w)^2, \quad \hat f'_k = \hat f_k \cdot w_k

  1. Inverse FFT for denoised features:

f′(b,t,c,h,w)=1T∑k=0T−1f^k′(b,c,h,w) e+j2πkt/Tf'(b,t,c,h,w)=\frac1T\sum_{k=0}^{T-1}\hat f'_k(b,c,h,w)\,e^{+j2\pi kt/T}

This process emphasizes frequency bands that best separate global, UAV-induced motion from object-centric motion.

2.2 Spatiotemporal Correlation Modeling

Spatial dimensions are flattened as s=0…S−1s=0\ldots S-1, S=H⋅WS=H\cdot W.

  1. 2D FFT over (time, space):

F^′(ft,fs)=∑t=0T−1∑s=0S−1f′(t,s) e−j2π(ftt/T+fss/S)\hat F'(f_t,f_s)=\sum_{t=0}^{T-1}\sum_{s=0}^{S-1} f'(t,s)\,e^{-j2\pi(f_t t/T + f_s s/S)}

  1. Power spectral density (PSD):

Sf′(ft,fs)=∣F^′(ft,fs)∣2S_{f'}(f_t,f_s)=|\hat F'(f_t,f_s)|^2

  1. Autocorrelation via inverse 2D FFT:

Rf′(t,s)=ℜ{1TS∑ft=0T−1∑fs=0S−1Sf′(ft,fs)e+j2π(ftt/T+fss/S)}R_{f'}(t,s)=\Re\Bigl\{\frac{1}{TS}\sum_{f_t=0}^{T-1}\sum_{f_s=0}^{S-1} S_{f'}(f_t,f_s)e^{+j2\pi(f_t t/T + f_s s/S)}\Bigr\}

  1. Attentioned feature composition:

fˉ=f′+Rf′⊙f′\bar f = f' + R_{f'}\odot f'

This yields features that capture joint global spatiotemporal dependencies, supporting effective separation of scene and object motion.

3. Temporal Dilation Mamba Module (TDMM)

TDMM exploits the Mamba structured state-space model, applying multi-scale and multi-scan strategies to extract temporal patterns across both short and long-range contexts.

3.1 Spatiotemporal Mamba (STMamba) Core

  1. Feature projection and normalization generate input XX and gating ZZ.
  2. Hybrid scan families:
    • Pixel-wise temporal-first: processes each pixel’s TT-length temporal sequence
    • Patch-wise spatial-first: divides each frame into P×PP\times P patches, tracking spatiotemporal evolution of each
  3. Scan implementation: Each of $6$ forward + $6$ backward scan sequences SiS_i is processed as

Xˉi=reshape(SSM(Conv1d(Si)))\bar X_i = \text{reshape}\Bigl(\mathrm{SSM}\bigl(\mathrm{Conv1d}(S_i)\bigr)\Bigr)

  1. Gated summation and skip connection:

out=Linear(∑i=112σ(Z)⊙Xˉi)+f\mathrm{out} = \mathrm{Linear}\Bigl(\sum_{i=1}^{12} \sigma(Z) \odot \bar X_i\Bigr) + f

where σ=SiLU\sigma=\mathrm{SiLU}.

3.2 Multi-scale Temporal Dilation

TDMM applies STMamba over multiple temporal dilation rates η∈{1,2,3}\eta\in\{1,2,3\}:

  1. Reversible reshaping Φη\Phi_\eta extracts and temporally subsamples sequences at rate η\eta.
  2. Dilation-aggregated processing:

f~=STMamba(∑η∈{1,2,3}Φη−1(STMamba(Φη(f))))\tilde f = \mathrm{STMamba}\Bigl(\sum_{\eta\in\{1,2,3\}} \Phi_\eta^{-1}\bigl(\mathrm{STMamba}(\Phi_\eta(f))\bigr)\Bigr)

This combination yields representations sensitive to both slow, UAV-induced global changes (large η\eta) and fast, object-centric local changes (small η\eta), enhancing discrimination between normal and anomalous events in dynamic videos.

4. Training Objectives and Optimization Strategy

4.1 Loss Functions

FTDMamba uses a weighted sum of three loss terms:

  • Intensity loss:

Lint(Y^,Y)=∥Y^−Y∥22L_{int}(\hat Y,Y)=\|\hat Y - Y\|_2^2

  • Gradient loss: Measures discrepancies in horizontal and vertical image gradients:

Lgrl(Y^,Y)=∑i,j∥∣Y^i,j−Y^i−1,j∣−∣Yi,j−Yi−1,j∣∥1+…L_{grl}(\hat Y,Y) = \sum_{i,j}\|\lvert\hat Y_{i,j}-\hat Y_{i-1,j}\rvert - \lvert Y_{i,j}-Y_{i-1,j}\rvert\|_1 + \ldots

  • Structural similarity LssimL_{ssim}: Computed at multiple resolutions.
  • Total weighted loss:

L=α Lint+β Lgrl+γ LssimL = \alpha\,L_{int} + \beta\,L_{grl} + \gamma\,L_{ssim}

4.2 Training Protocol

  • Input: Six consecutive frames as context to predict the seventh.
  • Preprocessing: Resizing to 256×256256\times256, pixel normalization to [−1,1][-1,1].
  • Optimization: AdamW with cosine-annealing learning rate schedule, 200 epochs, batch size $8$ on two RTX 3090 GPUs. Learning rates: 5×10−55\times 10^{-5} (Drone-Anomaly, MUVAD), 1×10−41\times 10^{-4} (UIT-ADrone).
  • TDMM: STMamba depth $1$, patch size P=4P=4, dilations {1,2,3}\{1,2,3\}.

5. Moving UAV VAD (MUVAD) Dataset

A large-scale dataset, MUVAD, is introduced to address the lack of suitable dynamic-background UAV VAD data.

Split Clips Frames Anomalies (Events, Types) Resolution
Train 46 126,254 0 (only normal) 852×480852\times480
Test 72 96,482 240 (12 types) 852×480852\times480
  • FPS: 30, dense frame-level binary annotation of anomaly presence for test set.
  • Anomaly types (Table I): Illegal lane change (21), Emergency lane violation (39), Wrong-way driving (15), Pedestrian intrusion (41), among others.
  • Annotation: Multi-annotator cross-validation.
  • Preprocessing: Filtering to exclude blurred, edited, non-UAV sources; resizing; normalization.

6. Empirical Performance and Analysis

6.1 Quantitative Results

FTDMamba outperforms existing methods by significant margins:

Dataset Micro-AUC Macro-AUC EER SOTA Margin
Drone-Anomaly 71.6% 72.3% 0.336 +4%
UIT-ADrone 70.7% 69.5% 0.368
MUVAD 71.4% 68.4% 0.372

FTDMamba consistently surpasses ground-surveillance (e.g., MA-PDM, VAD-Mamba) and UAV-specific baselines (ANDT, ASTT, HSTforU) in both static and dynamic scenarios.

6.2 Ablation and Component Analysis

  • FDSCM: Addition increases Micro-AUC by +3.1% (UIT-ADrone), +5.8% (MUVAD). Omitting temporal frequency decoupling or spatiotemporal correlation causes 2–4% drops.
  • TDMM: STMamba (with FDSCM) adds +5.8%/+7.2%. Multi-scale temporal (MST) dilation provides further +5.4%/+3.8%.
  • Scan strategies: Hybrid pixel-temporal + patch-spatial superior to single-mode.
  • STMamba depth: Depth $1$ chosen; deeper layers yield negligible accuracy gains with halved throughput.
  • Parallelism: Parallel FDSCM+TDMM outperforms cascaded variants by 2.0–4.2%.

6.3 Robustness

  • Gaussian noise (σ up to 100): <2% drop; σ=250 yields 63.9% AUC.
  • Random occlusion (up to 30% missing frames): <3.5% performance loss; 50% missing frames: AUC 62.5–64.7%.

6.4 Qualitative Insights

  • Anomaly scores: Correlated sharply with actual anomalies, robust even under heavy UAV motion jitter.
  • Error maps: Highlight anomalous spatiotemporal regions matching ground truth annotations.

7. Relevance and Research Implications

FTDMamba demonstrates effective integration of physics-inspired frequency decomposition with advanced state-space sequence modeling. The parallel use of FDSCM and TDMM allows explicit disentanglement of background and object motion while capturing a hierarchy of temporal correlations. The introduction of MUVAD addresses a gap in benchmarking VAD under realistic UAV dynamics, enabling future research into more generalizable models. The robust empirical results establish FTDMamba as a reference method for aerial VAD in both academic and applied contexts (Liu et al., 16 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Frequency-Assisted Temporal Dilation Mamba (FTDMamba) Network.