Papers
Topics
Authors
Recent
Search
2000 character limit reached

EFSI-DETR: Real-Time UAV Small Object Detector

Updated 2 February 2026
  • EFSI-DETR is a detection framework that employs adaptive frequency-spatial fusion, deep semantic feature extraction, and fine-grained retention to enhance small object detection in UAV images.
  • The model maintains real-time performance (≥188 FPS on RTX 4090) and compact size (27.3M parameters) while achieving state-of-the-art accuracy on UAV benchmarks.
  • Dynamic modules like DyFusNet and ESFC contribute to robust multi-scale feature fusion and semantic concentration, leading to significant improvements in detection precision.

EFSI-DETR (Efficient Frequency-Semantic Integration DETR) is a detection framework designed for real-time small object detection in UAV (Unmanned Aerial Vehicle) imagery. Addressing the challenges of limited feature representation, ineffective multi-scale fusion, and underutilization of frequency information in existing approaches, EFSI-DETR integrates dynamic frequency-spatial fusion, efficient deep semantic feature extraction, and fine-grained retention mechanisms. The model achieves state-of-the-art accuracy on benchmarks such as VisDrone and CODrone, while maintaining real-time speed and compact model size (Xia et al., 26 Jan 2026).

1. Architectural Overview and Design Objectives

EFSI-DETR builds upon the real-time DETR (RT-DETR) backbone, retaining an encoder–decoder pipeline but incorporating specialized modules to address the unique challenges of small object detection in UAV imagery. The principal design goals are:

  • Enhanced small-object representation via adaptive fusion of frequency and spatial cues.
  • Deep semantic feature extraction with low computational overhead.
  • Preservation of high-resolution, fine-grained details necessary for accurate small object localization.
  • Real-time inference speed (≥188 FPS on RTX 4090) with model compactness (27.3 M parameters).

The architectural innovations comprise three main components:

  • Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet): Integrates frequency and spatial feature fusion, implemented in the multi-scale fusion stage of the encoder.
  • Efficient Semantic Feature Concentrator (ESFC): Extracts concentrated deep semantic features from the coarsest encoder maps just before decoding.
  • Fine-grained Feature Retention (FFR): Routes shallow, high-resolution features into the decoder and omits the coarsest map for enhanced edge and texture preservation.

2. Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet)

DyFusNet realizes hardware-friendly, adaptive multi-scale fusion by synthesizing frequency-band decomposition, spatial aggregation, and channel-wise gating using only spatial operations.

2.1 Dynamic Multi-resolution Spectral Decomposition (DMSD)

DMSD is formulated as: FDMSD(X)=i{low,mid,high}αi(X)Hi(X)\mathcal{F}_{\mathrm{DMSD}}(X) = \sum_{i\in\{\mathrm{low},\mathrm{mid},\mathrm{high}\}} \alpha_i(X)\,\mathcal{H}_i(X) where for input feature map XRC×H×WX \in \mathbb{R}^{C \times H \times W}:

  • Hlow(X)=AvgPool3×3(X)\mathcal{H}_{\mathrm{low}}(X) = \text{AvgPool}_{3\times3}(X) (low-pass),
  • Hmid(X)=X\mathcal{H}_{\mathrm{mid}}(X) = X (identity, all-pass),
  • Hhigh(X)=Conv3×3dw(X)\mathcal{H}_{\mathrm{high}}(X) = \text{Conv}^{dw}_{3\times3}(X) (learnable high-pass).

Content-adaptive soft coefficients α\alpha are obtained via global average pooling, two-layer MLP, and softmax:

[αlow,αmid,αhigh]=softmax(W2GELU(W1GAP(X)))[\alpha_\mathrm{low},\alpha_\mathrm{mid},\alpha_\mathrm{high}]^\top = \mathrm{softmax}(W_2\,\mathrm{GELU}(W_1\,\mathrm{GAP}(X)))

2.2 Spatial–Frequency Cooperative Modulation (SFCM)

SFCM refines the fused features by aggregating spatial context across multiple receptive fields and modulating by channel importance:

  • Spatial aggregation:

Z(X)=W1×1X+k{3,5}WdwkXZ(X) = W_{1\times1}*X + \sum_{k\in\{3,5\}} W^{\mathrm{dw}_k}*X

  • Channel gating:

s=GAP(Z(X)),β(X)=σ(W2ReLU(W1s))s = \mathrm{GAP}(Z(X)), \quad \beta(X) = \sigma(W_2\,\mathrm{ReLU}(W_1\,s))

Final output:

FSFCM(X)=Z(X)β(X)\mathcal{F}_{\mathrm{SFCM}}(X) = Z(X)\,\odot\,\beta(X)

2.3 Channel Splitting and Fusion

To balance efficiency with representational power, DyFusNet splits channels: only a fraction (X1X_1) undergoes DMSD→SFCM, while the rest (X2X_2) bypass this expensive path. Fused output is:

Ffreq=SFCM(DMSD(X1)),FDyFusNet(X)=Conv1×1(Concat(Ffreq,X2))F_{\mathrm{freq}} = \mathrm{SFCM}(\mathrm{DMSD}(X_1)), \quad \mathcal{F}_{\mathrm{DyFusNet}}(X) = \mathrm{Conv}_{1\times1}(\mathrm{Concat}(F_{\mathrm{freq}},\,X_2))

This mechanism yields adaptive weighting and synergistic fusion of frequency and spatial detail.

3. Efficient Semantic Feature Concentrator (ESFC)

ESFC operates on the coarsest feature map, focusing on maximum semantic extraction with minimal computation via three sub-modules:

3.1 Dynamic Expert Convolution (DEConv)

Parallel lightweight expert convolutions WkW_k are attended by input-adaptive weights δk\delta_k:

FDEConv(X)=k=1Kδk(WkX)\mathcal{F}_{\mathrm{DEConv}}(X) = \sum_{k=1}^K \delta_k\,(W_k*X)

3.2 Efficient Ghost Block (EGBlock)

Building on GhostNet, EGBlock generates primary and "ghost" (expanded) features using cheap depthwise convolutions:

Fprimary=Φ(WprimaryX)F_\mathrm{primary} = \Phi(W_{\mathrm{primary}}*X)

Fghost=Φ(WcheapFprimary)F_\mathrm{ghost} = \Phi(W_\mathrm{cheap}*F_\mathrm{primary})

FEGBlock(X)=Concat(Fprimary,Fghost)\mathcal{F}_\mathrm{EGBlock}(X) = \mathrm{Concat}(F_\mathrm{primary}, F_\mathrm{ghost})

Multiple EGBlocks (with residual links) constitute the ESFC stack.

3.3 Dual-domain Guidance Aggregation (DGA)

DGA further enhances features by sequential channel and spatial attention, utilizing ECA-Net style channel guidance and a spatial mask based on pooled activations:

k=log2(C)+bγodd,Gc(X)=WkAvgPool(X)k = |\tfrac{\log_2(C)+b}{\gamma}|_{\mathrm{odd}}, \qquad G_c(X) = W_k*\mathrm{AvgPool}(X)

Gs(X)=σ(WsConcat(AvgPool(X),MaxPool(X)))G_s(X) = \sigma(W_s*\mathrm{Concat}(\mathrm{AvgPool}(X),\,\mathrm{MaxPool}(X)))

FDGA(X)=Gs(Gc(X))\mathcal{F}_\mathrm{DGA}(X) = G_s(G_c(X))

Final outputs integrate DEConv, EGBlocks, and DGA within a lightweight dual-branch structure.

4. Fine-grained Feature Retention and Multi-scale Decoder Strategy

EFSI-DETR uses the Fine-grained Feature Retention (FFR) strategy to address the attenuation of spatial detail in deep layers, which particularly degrades small object accuracy:

  • Shallow encoder feature maps (S1,S2S_1, S_2) are routed into the decoder; the coarsest (F5F_5) is omitted.
  • The decoder fuses F2,F3,F4F_2, F_3, F_4 to ensure high-resolution features contribute to bounding-box queries.

This retention mechanism preserves fine-scale texture and edge information often critical for detecting small, low-contrast UAV targets.

5. Performance Evaluation and Ablation Analysis

EFSI-DETR is empirically validated on VisDrone and CODrone datasets using COCO-style metrics. Key implementation parameters include 640×640 input resolution, AdamW optimizer, and inference using TensorRT FP16.

5.1 Benchmark Results

Model Params (M) AP (%) APs_\mathrm{s} (%) FPS (RTX 4090)
EFSI-DETR (640) 27.3 33.1 24.8 188
RT-DETR-R50 (640) 42.0 26.8 18.4 196
  • On VisDrone, EFSI-DETR achieves a 6.3-point gain in AP and 6.4 points in APs_\mathrm{s} with 35% fewer parameters relative to RT-DETR-R50.
  • On CODrone, EFSI-DETR achieves AP 20.2% (vs. 17.8%) and APs_\mathrm{s} 4.3% (vs. 2.9%).

5.2 Ablation Study

Analyzing increments from the baseline (RT-DETR-R18):

Variant Added Component(s) AP (%) APs_\mathrm{s} (%) Params (M)
V_A Baseline 26.9 18.3 25.6
V_B + FFR 31.3 23.2 27.7
V_C + DyFusNet 32.7 24.6 28.8
V_D + ESFC (full EFSI-DETR) 33.1 24.8 27.3

FFR demonstrates the largest individual gain, while DyFusNet and ESFC offer significant, additive improvements in both AP and APs_\mathrm{s}. Three-expert ESFC configuration and placement at the deep encoder stage are empirically optimal.

6. Computational Efficiency and System Characteristics

EFSI-DETR maintains high throughput and compactness:

  • 27.3 M parameters.
  • Approximately 291 GFLOPs for 640×640 input.
  • 5.3 ms per image latency (188 FPS) on RTX 4090, tested end-to-end with TensorRT FP16.

Compared to traditional static multi-scale fusion (e.g., FPN), DyFusNet yields richer, input-adaptive, context-sensitive representations, which is critical for visually small and ambiguous targets.

7. Significance and Applicability

EFSI-DETR demonstrates that adaptive frequency-spatial fusion, efficient semantic concentration, and fine-grained retention can effectively address the nuanced demands of small object detection in UAV datasets. The integration of these mechanisms enables real-time deployment with state-of-the-art accuracy and resource efficiency, supporting applications requiring precise and rapid perception in aerial and resource-constrained environments (Xia et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EFSI-DETR.