EFSI-DETR: Real-Time UAV Small Object Detector
- EFSI-DETR is a detection framework that employs adaptive frequency-spatial fusion, deep semantic feature extraction, and fine-grained retention to enhance small object detection in UAV images.
- The model maintains real-time performance (≥188 FPS on RTX 4090) and compact size (27.3M parameters) while achieving state-of-the-art accuracy on UAV benchmarks.
- Dynamic modules like DyFusNet and ESFC contribute to robust multi-scale feature fusion and semantic concentration, leading to significant improvements in detection precision.
EFSI-DETR (Efficient Frequency-Semantic Integration DETR) is a detection framework designed for real-time small object detection in UAV (Unmanned Aerial Vehicle) imagery. Addressing the challenges of limited feature representation, ineffective multi-scale fusion, and underutilization of frequency information in existing approaches, EFSI-DETR integrates dynamic frequency-spatial fusion, efficient deep semantic feature extraction, and fine-grained retention mechanisms. The model achieves state-of-the-art accuracy on benchmarks such as VisDrone and CODrone, while maintaining real-time speed and compact model size (Xia et al., 26 Jan 2026).
1. Architectural Overview and Design Objectives
EFSI-DETR builds upon the real-time DETR (RT-DETR) backbone, retaining an encoder–decoder pipeline but incorporating specialized modules to address the unique challenges of small object detection in UAV imagery. The principal design goals are:
- Enhanced small-object representation via adaptive fusion of frequency and spatial cues.
- Deep semantic feature extraction with low computational overhead.
- Preservation of high-resolution, fine-grained details necessary for accurate small object localization.
- Real-time inference speed (≥188 FPS on RTX 4090) with model compactness (27.3 M parameters).
The architectural innovations comprise three main components:
- Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet): Integrates frequency and spatial feature fusion, implemented in the multi-scale fusion stage of the encoder.
- Efficient Semantic Feature Concentrator (ESFC): Extracts concentrated deep semantic features from the coarsest encoder maps just before decoding.
- Fine-grained Feature Retention (FFR): Routes shallow, high-resolution features into the decoder and omits the coarsest map for enhanced edge and texture preservation.
2. Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet)
DyFusNet realizes hardware-friendly, adaptive multi-scale fusion by synthesizing frequency-band decomposition, spatial aggregation, and channel-wise gating using only spatial operations.
2.1 Dynamic Multi-resolution Spectral Decomposition (DMSD)
DMSD is formulated as: where for input feature map :
- (low-pass),
- (identity, all-pass),
- (learnable high-pass).
Content-adaptive soft coefficients are obtained via global average pooling, two-layer MLP, and softmax:
2.2 Spatial–Frequency Cooperative Modulation (SFCM)
SFCM refines the fused features by aggregating spatial context across multiple receptive fields and modulating by channel importance:
- Spatial aggregation:
- Channel gating:
Final output:
2.3 Channel Splitting and Fusion
To balance efficiency with representational power, DyFusNet splits channels: only a fraction () undergoes DMSD→SFCM, while the rest () bypass this expensive path. Fused output is:
This mechanism yields adaptive weighting and synergistic fusion of frequency and spatial detail.
3. Efficient Semantic Feature Concentrator (ESFC)
ESFC operates on the coarsest feature map, focusing on maximum semantic extraction with minimal computation via three sub-modules:
3.1 Dynamic Expert Convolution (DEConv)
Parallel lightweight expert convolutions are attended by input-adaptive weights :
3.2 Efficient Ghost Block (EGBlock)
Building on GhostNet, EGBlock generates primary and "ghost" (expanded) features using cheap depthwise convolutions:
Multiple EGBlocks (with residual links) constitute the ESFC stack.
3.3 Dual-domain Guidance Aggregation (DGA)
DGA further enhances features by sequential channel and spatial attention, utilizing ECA-Net style channel guidance and a spatial mask based on pooled activations:
Final outputs integrate DEConv, EGBlocks, and DGA within a lightweight dual-branch structure.
4. Fine-grained Feature Retention and Multi-scale Decoder Strategy
EFSI-DETR uses the Fine-grained Feature Retention (FFR) strategy to address the attenuation of spatial detail in deep layers, which particularly degrades small object accuracy:
- Shallow encoder feature maps () are routed into the decoder; the coarsest () is omitted.
- The decoder fuses to ensure high-resolution features contribute to bounding-box queries.
This retention mechanism preserves fine-scale texture and edge information often critical for detecting small, low-contrast UAV targets.
5. Performance Evaluation and Ablation Analysis
EFSI-DETR is empirically validated on VisDrone and CODrone datasets using COCO-style metrics. Key implementation parameters include 640×640 input resolution, AdamW optimizer, and inference using TensorRT FP16.
5.1 Benchmark Results
| Model | Params (M) | AP (%) | AP (%) | FPS (RTX 4090) |
|---|---|---|---|---|
| EFSI-DETR (640) | 27.3 | 33.1 | 24.8 | 188 |
| RT-DETR-R50 (640) | 42.0 | 26.8 | 18.4 | 196 |
- On VisDrone, EFSI-DETR achieves a 6.3-point gain in AP and 6.4 points in AP with 35% fewer parameters relative to RT-DETR-R50.
- On CODrone, EFSI-DETR achieves AP 20.2% (vs. 17.8%) and AP 4.3% (vs. 2.9%).
5.2 Ablation Study
Analyzing increments from the baseline (RT-DETR-R18):
| Variant | Added Component(s) | AP (%) | AP (%) | Params (M) |
|---|---|---|---|---|
| V_A | Baseline | 26.9 | 18.3 | 25.6 |
| V_B | + FFR | 31.3 | 23.2 | 27.7 |
| V_C | + DyFusNet | 32.7 | 24.6 | 28.8 |
| V_D | + ESFC (full EFSI-DETR) | 33.1 | 24.8 | 27.3 |
FFR demonstrates the largest individual gain, while DyFusNet and ESFC offer significant, additive improvements in both AP and AP. Three-expert ESFC configuration and placement at the deep encoder stage are empirically optimal.
6. Computational Efficiency and System Characteristics
EFSI-DETR maintains high throughput and compactness:
- 27.3 M parameters.
- Approximately 291 GFLOPs for 640×640 input.
- 5.3 ms per image latency (188 FPS) on RTX 4090, tested end-to-end with TensorRT FP16.
Compared to traditional static multi-scale fusion (e.g., FPN), DyFusNet yields richer, input-adaptive, context-sensitive representations, which is critical for visually small and ambiguous targets.
7. Significance and Applicability
EFSI-DETR demonstrates that adaptive frequency-spatial fusion, efficient semantic concentration, and fine-grained retention can effectively address the nuanced demands of small object detection in UAV datasets. The integration of these mechanisms enables real-time deployment with state-of-the-art accuracy and resource efficiency, supporting applications requiring precise and rapid perception in aerial and resource-constrained environments (Xia et al., 26 Jan 2026).