YOLOBirDrone: Aerial Drone & Bird Detection
- The paper introduces YOLOBirDrone, a novel architecture extending YOLOv9 with adaptive deformable convolutions and dual attention modules for improved aerial object detection.
- It integrates AELAN, MPDA, and RMPDA modules to adaptively learn multi-scale, context-enhanced features, significantly boosting precision, recall, and mAP especially for small and occluded targets.
- Empirical results on the BirDrone dataset demonstrate YOLOBirDrone’s superior detection speed and accuracy compared to state-of-the-art models for distinguishing small drones from birds.
YOLOBirDrone is a hybrid object detection architecture advancing vision-based aerial surveillance by specifically optimizing for the challenging task of differentiating drones from birds, particularly when the targets are small or subject to adverse imaging conditions. Introduced alongside the BirDrone dataset—a large-scale, annotated benchmark rich in small and indistinct objects—YOLOBirDrone extends the YOLOv9 baseline through the integration of three major components: the Adaptive and Extended Layer Aggregation Network (AELAN) in the backbone, the Multi-Scale Progressive Dual Attention (MPDA) module at shallow neck levels, and the Reverse MPDA (RMPDA) at deeper stages. These innovations collectively enable shape-adaptive, context-enhanced multi-scale representation learning, resulting in improved precision, recall, and mAP across a variety of aerial detection scenarios (Kaur et al., 13 Jan 2026).
1. Architectural Innovations and Baseline Extension
YOLOBirDrone follows the canonical YOLO partitioning: backbone (feature extractor), neck (feature fusion), and detection head (multi-scale prediction). It uses YOLOv9 as the point of departure, which incorporates GELAN—Generalized Efficient Layer Aggregation Network—blending ELAN and CSP blocks for hierarchical feature extraction. In YOLOBirDrone, GELAN is replaced by AELAN, which introduces learnable deformable convolutions, enabling dynamic adaptation of the sampling grid to object boundaries. The neck’s standard pyramid fusion is augmented with MPDA at high-resolution (shallow) levels and RMPDA at low-resolution (deep) levels, culminating in better local-global and spatial-channel balancing. The YOLOv9 head, predicting at three spatial scales, is preserved.
Data flow is:
1 2 3 4 |
Input (640×640) → Stem → (AELAN Blocks) → Multi-scale Feature Maps
├─ MPDA at P3, P4 (shallow)
└─ RMPDA at P5 (deep)
→ Fuse in FPN → YOLOv9 Head → {Objectness, Box, Class} |
2. Component Modules: AELAN, MPDA, and RMPDA
Adaptive and Extended Layer Aggregation Network (AELAN)
AELAN extends GELAN by embedding Deformable Convolution (DConv) layers in place of every standard convolution in the CSP paths. DConv layers learn pixel-wise offsets for each kernel position, yielding outputs , where is a subnetwork predicting the offsets. This enables adaptive receptive fields, critical for localizing and aligning small, non-rectilinear targets, such as distant birds or drones with variable aspect ratios.
Multi-Scale Progressive Dual Attention (MPDA)
MPDA is deployed at shallow neck levels handling high-resolution feature maps. It constructs four cascaded feature streams through and convolutions, applies dual attention (Spatial Attention with EPSANet on streams targeting fine detail and Channel Attention with EECA on streams capturing broader context), and concatenates the outputs. A final attention stage is applied to the concatenated tensor. This progressive dual attention hierarchy ensures both spatial precision and channel-wise selectivity, crucial for distinguishing targets in cluttered, heterogeneous backgrounds.
Reverse MPDA (RMPDA)
RMPDA, placed at deeper neck layers (low spatial resolution), reverses the MPDA attention order: Channel Attention precedes Spatial Attention on coarse scales, while the reverse is performed on fine scales. This orientation preserves the integrity of shape cues at coarse resolutions—by first emphasizing "what" information is important (channels) and then "where" (spatial)—while still attending to global context.
3. Integration and Dataflow through YOLO Hierarchy
AELAN blocks replace ELAN+CSP stacks in the backbone, generating multi-resolution features (P3, P4, P5). MPDA modules are inserted at P3 and P4 outputs (higher spatial resolution), while RMPDA is deployed at P5 (lowest spatial resolution). Fused features propagate through conventional Feature Pyramid Network (FPN)/PANet-style connections into the three-scale YOLO head, producing, for each spatial cell and anchor, an objectness score, bounding box parameters , and per-class probabilities (here, bird and drone). This arrangement ensures enhanced spatial sensitivity for small objects while maintaining scalability to larger ones.
4. Loss Function, Training Regime, and Hyperparameters
Training uses the standard YOLO one-stage loss:
with:
- (Complete IoU regression)
- : binary cross-entropy on objectness score
- : binary cross-entropy on per-class probabilities
Typical hyperparameters: input size 640×640, batch size 16, 300 epochs, initial learning rate 0.01 (step or cosine decay schedule), optimizer: SGD with momentum or AdamW as per YOLOv9 standards.
5. BirDrone Dataset Properties
The BirDrone dataset comprises 11,495 RGB images (8,428 from BirDrone, 3,067 additional annotated frames), with 13,881 drones and 15,867 birds, annotated over a wide scale spectrum. The label distribution covers extremely small objects (<20×20 px: 1,129), small (20–32 px: 1,576), medium (32–96 px: 12,510), and large (>96 px: 14,553). The smallest annotated drone is 7×5 px. Data is split 70%/20%/10% for train/val/test. Scenes present challenges of low contrast, occlusion, motion blur, and background clutter, making them representative of real-world aerial surveillance.
6. Empirical Results and Comparative Performance
YOLOBirDrone was evaluated against multiple YOLO baselines and state-of-the-art detectors including YOLOv8–v12 and RT-DETRv2. Ablation studies indicate cumulative accuracy improvements from baseline (YOLOv9, M1: 81.73% detection accuracy) to full YOLOBirDrone (M6: 84.91% detection accuracy), with precision increasing from 0.929 to 0.949, recall from 0.907 to 0.917, [email protected] from 0.940 to 0.948, and [email protected]–0.95 from 0.644 to 0.668. Notably, YOLOBirDrone is faster (0.149 s/frame) than many recent models. Precision and recall gains reflect a tangible reduction in both false positives and false negatives, demonstrating the impact of AELAN, MPDA, and RMPDA modules (Kaur et al., 13 Jan 2026).
| Model | Precision | Recall | [email protected] | [email protected]–0.95 | Det. Acc. (%) | Inference (s/frame) |
|---|---|---|---|---|---|---|
| YOLOv9 | 0.929 | 0.907 | 0.940 | 0.644 | 81.73 | 0.211 |
| YOLOBirDrone | 0.949 | 0.917 | 0.948 | 0.668 | 84.91 | 0.149 |
Ablation indicates each module contributes: AELAN (+1.1 pp acc.), MPDA (+0.8 pp), RMPDA (+1.3 pp), full stack (+3.2 pp over YOLOv9). The largest mAP gain is achieved on small and occluded targets.
7. Context within Drone and Bird Detection Research
YOLOBirDrone addresses a critical operational challenge unmet by preceding models: the robust differentiation between small drones and birds under complex environmental and imaging scenarios. While alternatives such as YOLO-Drone (which targets dense small object detection using novel backbones and loss functions (Zhu et al., 2023)), YOLO-FEDER FusionNet (which fuses camouflage segmentation priors (Lenhard et al., 2024)), and classical lightweight YOLO variants for multi-drone scenarios (Sharma et al., 2022) compete on related challenges, YOLOBirDrone stands out in its explicit integration of deformable convolutions and progressive dual attention in both spatial and channel modalities. A plausible implication is that shape-adaptive convolution and progressive attention at multiple pyramid levels may become standard in future architectures targeting general small-object detection and fine-grained inter-class discrimination in remote sensing and surveillance.
References
- "YOLOBirDrone: Dataset for Bird vs Drone Detection and Classification and a YOLO based enhanced learning architecture" (Kaur et al., 13 Jan 2026)
- "YOLO-Drone: Airborne real-time detection of dense small objects from high-altitude perspective" (Zhu et al., 2023)
- "YOLO-FEDER FusionNet: A Novel Deep Learning Architecture for Drone Detection" (Lenhard et al., 2024)
- "Lightweight Multi-Drone Detection and 3D-Localization via YOLO" (Sharma et al., 2022)