Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual-Dynamic Tracking (UAV-Anti-UAV)

Updated 22 January 2026
  • Dual-Dynamic Tracking is a framework that addresses the challenge of real-time localization and pursuit when both the observer and target UAVs are dynamically maneuvering.
  • The approach incorporates transformer-based spatial-temporal modeling, evidential detection, and hybrid hardware strategies to manage occlusion, rapid motion, and distractor-rich environments.
  • Extensive datasets and benchmarks have been developed to assess tracking performance, highlighting current limitations and guiding future research in UAV anti-drone defense systems.

Dual-Dynamic Tracking (UAV-Anti-UAV) seeks to address the challenge of robust, real-time localization and pursuit of adversarial aerial targets when both the observing platform (pursuer UAV) and the target (evasive UAV) are dynamically maneuvering. Unlike classical ground-based anti-UAV settings or standard single-dynamic tracking, dual-dynamic tracking requires algorithms and systems to withstand complex, non-stationary backgrounds, abrupt viewpoint and scale changes, rapid motion, occlusion, and distractor-rich scenarios. Recent research formalizes the UAV-Anti-UAV problem, benchmarks its unique challenges via large-scale datasets, and proposes architectural innovations—for both learning-based and embedded hardware approaches—that jointly reason about temporal, spatial, and semantic context under adversarial aerial dynamics (Zhang et al., 8 Dec 2025, Zhu et al., 2023, Lu et al., 12 Dec 2025).

1. Formal Problem Definition and Unique Challenges

Let {It}t=1T\{I_t\}_{t=1}^T be a video sequence acquired from a pursuing UAV; let z1=[x1,y1,w1,h1]\mathbf{z}_1 = [x_1, y_1, w_1, h_1] denote the initial ground-truth bounding box of the adversarial target UAV. The task is to estimate, for each t2t \geq 2, a bounding box zt^\widehat{\mathbf{z}_t} that maintains tight spatial overlap (high IoU) with the ground truth as both observer and target undergo rapid, unpredictable 3D motion.

Distinguishing characteristics:

  • Dual Agent Dynamics: Both the sensing UAV and the adversarial UAV exhibit non-linear, unconstrained flight paths, inducing severe compounded ego-motion, multidimensional parallax, and scene contextual drift.
  • Small, Fast Targets: The target UAV typically occupies <1% of the frame in 30% of samples and exhibits extreme velocity (relative speed μ=0.79 image diagonals/frame).
  • Frequent Occlusions and Out-of-View: Full/partial occlusion, target reappearance, and viewpoint changes are common; re-detection with no fixed template frame is compulsory.
  • Clutter, Distractors, and Unaligned Modalities: Backgrounds vary from urban to rural, with similar distractors and multi-modal sensor streams (RGB, IR, language).
  • Real-Time, Resource-Constrained Requirements: Tracking must operate at ≥30fps, typically without per-frame template re-initialization or excessive compute.

Formally, tracking is cast as a spatio-temporal sequence estimation in which both input and target state distributions are non-stationary:

zt^=ftrack(It,Ht1,L;w)\widehat{\mathbf{z}_t} = f_{\text{track}}(I_t, \mathbf{H}_{t-1}, L; w)

where Ht1\mathbf{H}_{t-1} is accumulated temporal memory (motion/appearance), LL is the sequence-level language prompt describing the scenario, and ww are the network parameters (Zhang et al., 8 Dec 2025).

2. Datasets and Task Benchmarking

  • 1,810 video sequences (≈1.05 million frames; 9.85 hours)
  • RGB frames with contiguous bounding box annotations
  • One sequence-level language prompt per video
  • 15 per-sequence/per-frame attributes: Camera Motion (CM), Fast Motion (FM), Small Object (SO), Similar Distractor (SD), Out-of-View (OV), Occlusion (PO/FO), Illumination Variation (IV), Scale/Aspect Ratio Variation (SV/ARV), Motion Blur (MB), etc.
  • High diversity: avg. sequence length 578 frames (max 17,740); macrostatistical skew toward fast, small, partially or fully occluded targets
  • 600 thermal IR videos (640×512, 25fps, >723,000 annotated frames)
  • Per-frame challenge tags: Out-of-View, Occlusion, Fast Motion, Scale-Variation, etc.
  • No fixed appearance template; random target absence/reappearance
  • 318 RGB/IR video pairs at 25fps; 585,900 boxes
  • Binary frame/sequence attributes; highly challenging for robust generalization

Benchmarking Metrics

  • IoU-based AUC (area under IoU-threshold curve) and mean accuracy (mACC)
  • Precision/Success: Center-error thresholds, area-under-curve
  • State Accuracy (SA) and “Acc” for dual-dynamic anti-UAV: combine IoU on present frames, “correct absence” credit, and penalize missed detections

3. Dual-Dynamic Tracking Architectures

  • Visual Backbone: Transformer-based model (HiViT) hierarchically encodes both template and search regions
  • Temporal Modeling: Discrete linear time-invariant state-space model (SSM) (Mamba) evolves a hidden state vector:

ht=Aht1+But,yt=Cht+Dvt\mathbf{h}_t = A \mathbf{h}_{t-1} + B \mathbf{u}_t, \qquad \mathbf{y}_t = C \mathbf{h}_t + D \mathbf{v}_t

  • Language Integration: Mamba-encoded representations from language prompts propagate scenario-level semantic priors through SSM scanning
  • Temporal Token Propagation: Past search features serialized via unidirectional scan inform per-frame context in an autoregressive manner
  • Multi-Head Tracking Head: Outputs class, offset, and size branches with a composite loss (focal classification, L1L_1, and generalized IoU)
  • No online fine-tuning in inference; purely “in-distribution” tracking
  • Detection-Tracking Alternation: YOLOv5s global detector runs on every frame; on positive hypothesis, a local transformer tracker is initialized
  • Tracker: Siamese-style backbone with Relevance Decoupling Modules (RDM); cross- and self-attention fuse instance and appearance context
  • Uncertainty Quantification (“Evidential Head”): After tracking, evidence is pooled; Dirichlet-distributed probabilities compute per-frame target/bkg likelihood and global uncertainty u=K/Su = K/S
  • Evidential Loop: Tracking continues if confidence p^target0.5\hat{p}_\text{target}\ge0.5 and uθehu\leq\theta_\text{eh}; else revert to detection
  • Dempster–Shafer Belief Formalism: Detection and tracking operate as independent evidence sources, enabling robust recovery under drift, occlusion, and re-entry
  • Frame-Based and Event-Driven Modes: Adaptive switching on Region Proposal (RP) area and velocity; frame mode for periodic broad detection, event mode for low-latency fast targets.
  • Fast Object Tracking Unit (FOTU): Parallelized trajectory monitors for high-velocity targets; enforces bounding-box update adaptivity via

TH=b+WaArea+WsSpeed\text{TH} = b + W_a \cdot \text{Area} + W_s \cdot \text{Speed}

  • Neural Processing Unit (NPU): Custom 16×16 PE array, zero-skipping MACs, dual support for image patch and trajectory inference

4. Algorithmic and System Comparisons

Architecture Modalities Tracking Model AUC / mAcc Notable Strengths
MambaSTS (Zhang et al., 8 Dec 2025) RGB + language Transformer + SSM + prompt AUC=0.437, mAcc=0.443 Long-sequence modeling, semantic context
EDTC (Zhu et al., 2023) Thermal IR YOLOv5s + transformer Acc=0.486 (AntiUAV600) Uncertainty-based switching, re-detection
Hybrid Hardware (Lu et al., 12 Dec 2025) AER events + grayscale Frame/event-adaptive tracking Prec@IoU≥0.5=94.8% Ultra low-power, high-speed, energy efficiency

MambaSTS delivers +6.6 percentage point mean accuracy gain on UAV-Anti-UAV over the next best baseline on dual-dynamic tracking, but even the best trackers only reach ≈44% AUC, highlighting the severity of compounded motion and complex scene factors (Zhang et al., 8 Dec 2025). EDTC demonstrates state-of-the-art performance on absence/reappearance sequences by integrating confidence-driven mode switching; ablations reveal the criticality of evidential heads for robust adaptation (Zhu et al., 2023). Hybrid hardware approaches achieve Pareto-optimal energy/tracking tradeoffs necessary for embedded UAV platforms and extremely fast-moving targets (Lu et al., 12 Dec 2025).

5. Semantic and Multimodal Consistency Mechanisms

Recent advances utilize semantic “flows” and multi-level modulation to mitigate appearance shift and distractor confusion:

  • Dual-Flow Semantic Consistency (DFSC) (Jiang et al., 2021): Class-level modulation enforces inter-sequence UAV-category consistency (via cross-sequence ROI features), while instance-level modulation sharpens same-sequence discrimination. Composite loss combines RPN and Fast-RCNN objectives.
  • Multi-Modal Sensing: IR, RGB, language, and—prospectively—radar/LiDAR features are integrated as independent evidence sources, consistent with Dempster–Shafer fusion (Zhu et al., 2023, Zhang et al., 8 Dec 2025).
  • Dynamic Radar Networks (Guerra et al., 2020): Teams of UAV-based active sensors fuse range, bearing, Doppler, and velocity information via decentralized EKFs, optimizing 3D formation to maximize D-optimality of tracking information, and achieving sub-meter accuracy in environments intractable to fixed ground radars.

6. Performance Evaluations and Limitations

Quantitative performance in dual-dynamic scenarios remains limited:

  • MambaSTS (UAV-Anti-UAV test set): AUC=0.437; average mACC over 50 modern deep tracking algorithms is 0.272 (Zhang et al., 8 Dec 2025).
  • EDTC (AntiUAV600 test set): Acc=0.486; YOLOv5s-only detection yields Acc≈0.392; SOTA trackers without evidential switching range Acc∈0.14…0.33.
  • Hardware Hybrid System: 94.8% average tracking precision at 0.096 nJ/frame/pixel, up to 400 m and 80 px/s speed (Lu et al., 12 Dec 2025).

Limitations are pronounced under full occlusion, extreme illumination variation, or repeated out-of-view transitions; no current method can reliably re-detect under these compounded conditions. Multi-object and multi-modal extensions remain underexplored (Zhang et al., 8 Dec 2025, Zhu et al., 2023).

7. Open Research Directions

Challenges explicitly identified include:

  • Multi-sensor and multi-object extensions: True fusion of RGB, IR, RF, and language cues for drone swarm tracking or coordinated defense
  • Online adaptation and domain transfer: Rapid model updating to accommodate adversarial maneuvers and unseen environments
  • Nonlinear and bidirectional state-space modeling: Richer representations to encode context spanning extended temporal windows and abrupt regime changes
  • Hardware-efficiency: End-to-end quantization, energy-matched model design for embedded platforms
  • Advanced attention and re-detection: Learned mechanisms to recover from drift, occlusion, and out-of-view via dynamic attention or memory-based detection
  • Cooperative active sensing: Autonomous reconfiguration of sensor geometry (e.g., via distributed D-optimality) for maximally informative 3D tracking

The UAV-Anti-UAV framework and its associated datasets, architectural baselines, and metrics now form the empirical foundation for the next generation of robust aerial anti-UAV perception and pursuit systems, both in civilian safety and defense contexts (Zhang et al., 8 Dec 2025, Zhu et al., 2023, Lu et al., 12 Dec 2025, Jiang et al., 2021, Guerra et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-Dynamic Tracking (UAV-Anti-UAV).