Dual-Dynamic Tracking (UAV-Anti-UAV)
- Dual-Dynamic Tracking is a framework that addresses the challenge of real-time localization and pursuit when both the observer and target UAVs are dynamically maneuvering.
- The approach incorporates transformer-based spatial-temporal modeling, evidential detection, and hybrid hardware strategies to manage occlusion, rapid motion, and distractor-rich environments.
- Extensive datasets and benchmarks have been developed to assess tracking performance, highlighting current limitations and guiding future research in UAV anti-drone defense systems.
Dual-Dynamic Tracking (UAV-Anti-UAV) seeks to address the challenge of robust, real-time localization and pursuit of adversarial aerial targets when both the observing platform (pursuer UAV) and the target (evasive UAV) are dynamically maneuvering. Unlike classical ground-based anti-UAV settings or standard single-dynamic tracking, dual-dynamic tracking requires algorithms and systems to withstand complex, non-stationary backgrounds, abrupt viewpoint and scale changes, rapid motion, occlusion, and distractor-rich scenarios. Recent research formalizes the UAV-Anti-UAV problem, benchmarks its unique challenges via large-scale datasets, and proposes architectural innovations—for both learning-based and embedded hardware approaches—that jointly reason about temporal, spatial, and semantic context under adversarial aerial dynamics (Zhang et al., 8 Dec 2025, Zhu et al., 2023, Lu et al., 12 Dec 2025).
1. Formal Problem Definition and Unique Challenges
Let be a video sequence acquired from a pursuing UAV; let denote the initial ground-truth bounding box of the adversarial target UAV. The task is to estimate, for each , a bounding box that maintains tight spatial overlap (high IoU) with the ground truth as both observer and target undergo rapid, unpredictable 3D motion.
Distinguishing characteristics:
- Dual Agent Dynamics: Both the sensing UAV and the adversarial UAV exhibit non-linear, unconstrained flight paths, inducing severe compounded ego-motion, multidimensional parallax, and scene contextual drift.
- Small, Fast Targets: The target UAV typically occupies <1% of the frame in 30% of samples and exhibits extreme velocity (relative speed μ=0.79 image diagonals/frame).
- Frequent Occlusions and Out-of-View: Full/partial occlusion, target reappearance, and viewpoint changes are common; re-detection with no fixed template frame is compulsory.
- Clutter, Distractors, and Unaligned Modalities: Backgrounds vary from urban to rural, with similar distractors and multi-modal sensor streams (RGB, IR, language).
- Real-Time, Resource-Constrained Requirements: Tracking must operate at ≥30fps, typically without per-frame template re-initialization or excessive compute.
Formally, tracking is cast as a spatio-temporal sequence estimation in which both input and target state distributions are non-stationary:
where is accumulated temporal memory (motion/appearance), is the sequence-level language prompt describing the scenario, and are the network parameters (Zhang et al., 8 Dec 2025).
2. Datasets and Task Benchmarking
UAV-Anti-UAV Dataset (Zhang et al., 8 Dec 2025)
- 1,810 video sequences (≈1.05 million frames; 9.85 hours)
- RGB frames with contiguous bounding box annotations
- One sequence-level language prompt per video
- 15 per-sequence/per-frame attributes: Camera Motion (CM), Fast Motion (FM), Small Object (SO), Similar Distractor (SD), Out-of-View (OV), Occlusion (PO/FO), Illumination Variation (IV), Scale/Aspect Ratio Variation (SV/ARV), Motion Blur (MB), etc.
- High diversity: avg. sequence length 578 frames (max 17,740); macrostatistical skew toward fast, small, partially or fully occluded targets
AntiUAV600 (Zhu et al., 2023)
- 600 thermal IR videos (640×512, 25fps, >723,000 annotated frames)
- Per-frame challenge tags: Out-of-View, Occlusion, Fast Motion, Scale-Variation, etc.
- No fixed appearance template; random target absence/reappearance
Anti-UAV (Jiang et al., 2021)
- 318 RGB/IR video pairs at 25fps; 585,900 boxes
- Binary frame/sequence attributes; highly challenging for robust generalization
Benchmarking Metrics
- IoU-based AUC (area under IoU-threshold curve) and mean accuracy (mACC)
- Precision/Success: Center-error thresholds, area-under-curve
- State Accuracy (SA) and “Acc” for dual-dynamic anti-UAV: combine IoU on present frames, “correct absence” credit, and penalize missed detections
3. Dual-Dynamic Tracking Architectures
Spatial-Temporal-Semantic Modeling (MambaSTS) (Zhang et al., 8 Dec 2025)
- Visual Backbone: Transformer-based model (HiViT) hierarchically encodes both template and search regions
- Temporal Modeling: Discrete linear time-invariant state-space model (SSM) (Mamba) evolves a hidden state vector:
- Language Integration: Mamba-encoded representations from language prompts propagate scenario-level semantic priors through SSM scanning
- Temporal Token Propagation: Past search features serialized via unidirectional scan inform per-frame context in an autoregressive manner
- Multi-Head Tracking Head: Outputs class, offset, and size branches with a composite loss (focal classification, , and generalized IoU)
- No online fine-tuning in inference; purely “in-distribution” tracking
Evidential Detection and Tracking Collaboration (EDTC) (Zhu et al., 2023)
- Detection-Tracking Alternation: YOLOv5s global detector runs on every frame; on positive hypothesis, a local transformer tracker is initialized
- Tracker: Siamese-style backbone with Relevance Decoupling Modules (RDM); cross- and self-attention fuse instance and appearance context
- Uncertainty Quantification (“Evidential Head”): After tracking, evidence is pooled; Dirichlet-distributed probabilities compute per-frame target/bkg likelihood and global uncertainty
- Evidential Loop: Tracking continues if confidence and ; else revert to detection
- Dempster–Shafer Belief Formalism: Detection and tracking operate as independent evidence sources, enabling robust recovery under drift, occlusion, and re-entry
Hybrid Embedded/Hardware Tracking (Lu et al., 12 Dec 2025)
- Frame-Based and Event-Driven Modes: Adaptive switching on Region Proposal (RP) area and velocity; frame mode for periodic broad detection, event mode for low-latency fast targets.
- Fast Object Tracking Unit (FOTU): Parallelized trajectory monitors for high-velocity targets; enforces bounding-box update adaptivity via
- Neural Processing Unit (NPU): Custom 16×16 PE array, zero-skipping MACs, dual support for image patch and trajectory inference
4. Algorithmic and System Comparisons
| Architecture | Modalities | Tracking Model | AUC / mAcc | Notable Strengths |
|---|---|---|---|---|
| MambaSTS (Zhang et al., 8 Dec 2025) | RGB + language | Transformer + SSM + prompt | AUC=0.437, mAcc=0.443 | Long-sequence modeling, semantic context |
| EDTC (Zhu et al., 2023) | Thermal IR | YOLOv5s + transformer | Acc=0.486 (AntiUAV600) | Uncertainty-based switching, re-detection |
| Hybrid Hardware (Lu et al., 12 Dec 2025) | AER events + grayscale | Frame/event-adaptive tracking | Prec@IoU≥0.5=94.8% | Ultra low-power, high-speed, energy efficiency |
MambaSTS delivers +6.6 percentage point mean accuracy gain on UAV-Anti-UAV over the next best baseline on dual-dynamic tracking, but even the best trackers only reach ≈44% AUC, highlighting the severity of compounded motion and complex scene factors (Zhang et al., 8 Dec 2025). EDTC demonstrates state-of-the-art performance on absence/reappearance sequences by integrating confidence-driven mode switching; ablations reveal the criticality of evidential heads for robust adaptation (Zhu et al., 2023). Hybrid hardware approaches achieve Pareto-optimal energy/tracking tradeoffs necessary for embedded UAV platforms and extremely fast-moving targets (Lu et al., 12 Dec 2025).
5. Semantic and Multimodal Consistency Mechanisms
Recent advances utilize semantic “flows” and multi-level modulation to mitigate appearance shift and distractor confusion:
- Dual-Flow Semantic Consistency (DFSC) (Jiang et al., 2021): Class-level modulation enforces inter-sequence UAV-category consistency (via cross-sequence ROI features), while instance-level modulation sharpens same-sequence discrimination. Composite loss combines RPN and Fast-RCNN objectives.
- Multi-Modal Sensing: IR, RGB, language, and—prospectively—radar/LiDAR features are integrated as independent evidence sources, consistent with Dempster–Shafer fusion (Zhu et al., 2023, Zhang et al., 8 Dec 2025).
- Dynamic Radar Networks (Guerra et al., 2020): Teams of UAV-based active sensors fuse range, bearing, Doppler, and velocity information via decentralized EKFs, optimizing 3D formation to maximize D-optimality of tracking information, and achieving sub-meter accuracy in environments intractable to fixed ground radars.
6. Performance Evaluations and Limitations
Quantitative performance in dual-dynamic scenarios remains limited:
- MambaSTS (UAV-Anti-UAV test set): AUC=0.437; average mACC over 50 modern deep tracking algorithms is 0.272 (Zhang et al., 8 Dec 2025).
- EDTC (AntiUAV600 test set): Acc=0.486; YOLOv5s-only detection yields Acc≈0.392; SOTA trackers without evidential switching range Acc∈0.14…0.33.
- Hardware Hybrid System: 94.8% average tracking precision at 0.096 nJ/frame/pixel, up to 400 m and 80 px/s speed (Lu et al., 12 Dec 2025).
Limitations are pronounced under full occlusion, extreme illumination variation, or repeated out-of-view transitions; no current method can reliably re-detect under these compounded conditions. Multi-object and multi-modal extensions remain underexplored (Zhang et al., 8 Dec 2025, Zhu et al., 2023).
7. Open Research Directions
Challenges explicitly identified include:
- Multi-sensor and multi-object extensions: True fusion of RGB, IR, RF, and language cues for drone swarm tracking or coordinated defense
- Online adaptation and domain transfer: Rapid model updating to accommodate adversarial maneuvers and unseen environments
- Nonlinear and bidirectional state-space modeling: Richer representations to encode context spanning extended temporal windows and abrupt regime changes
- Hardware-efficiency: End-to-end quantization, energy-matched model design for embedded platforms
- Advanced attention and re-detection: Learned mechanisms to recover from drift, occlusion, and out-of-view via dynamic attention or memory-based detection
- Cooperative active sensing: Autonomous reconfiguration of sensor geometry (e.g., via distributed D-optimality) for maximally informative 3D tracking
The UAV-Anti-UAV framework and its associated datasets, architectural baselines, and metrics now form the empirical foundation for the next generation of robust aerial anti-UAV perception and pursuit systems, both in civilian safety and defense contexts (Zhang et al., 8 Dec 2025, Zhu et al., 2023, Lu et al., 12 Dec 2025, Jiang et al., 2021, Guerra et al., 2020).