DRMNet: Dense Region Mining Network

Updated 4 January 2026

DRMNet is a detector that integrates explicit per-pixel density maps to prioritize densely clustered tiny objects in remote sensing scenes.
It employs a Density Generation Branch, Dense Area Focusing Module, and Dual Filter Fusion Module to enhance feature precision and boundary sharpness.
The architecture, built on a YOLOv8-style framework, delivers state-of-the-art performance on AI-TOD and DTOD benchmarks by optimizing computational focus on high-density regions.

Dense Region Mining Network (DRMNet) is a single-stage detector designed for high-density tiny-object detection in remote sensing imagery. It integrates explicit per-pixel density maps as spatial priors via a dedicated Density Generation Branch, and introduces two density-driven feature modules—Dense Area Focusing Module and Dual Filter Fusion Module. DRMNet systematically directs computational resources towards densely populated object regions, employs localized global attention to balance efficiency with contextual modeling, and leverages frequency-domain decomposition to enhance discrimination and boundary sharpness for tiny objects in heavily occluded, cluttered scenes. Its architecture and training are anchored in a YOLOv8-style framework, and it demonstrates state-of-the-art effectiveness on benchmarks such as AI-TOD and DTOD (Zhao et al., 28 Dec 2025).

1. Architecture and Pipeline

DRMNet augments the YOLOv8 backbone with additional density-centered components for targeted feature enhancement:

Backbone and Neck: Employs a YOLOv8-style feature pyramid with layers P2–P5, preserving high-resolution feature maps.
Density Generation Branch (DGB): A parallel branch regressing per-pixel density maps at the C2/C3 stages to explicitly model object spatial distribution.
Dense Area Focusing Module (DAFM): Operates on shallow features (C2, C3) and the upsampled density map to create compact surrogate representations for densely clustered object regions, followed by lightweight agent-style attention.
Dual Filter Fusion Module (DFFM): Processes neck outputs (P2–P4), decomposing features into high- and low-frequency components via discrete cosine transform (DCT) and fusing these via density-guided cross-attention across scales.

The DGB operates in parallel with the backbone during both training and inference. DAFM and DFFM leverage the learned density prior to concentrate modeling capacity where it is most beneficial for dense, small object detection (Zhao et al., 28 Dec 2025).

2. Density Generation Branch (DGB)

The DGB is responsible for predicting a density map $D_{\mathrm{pred}} \in \mathbb{R}^{H\times W}$ that identifies regions harboring dense clusters of tiny objects.

Structure:
- Encoder: Lightweight ResNet-18 extracts features from C2/C3 outputs.
- Decoder: Three upsampling BasicBlocks reconstruct spatial resolution.
- Regressor: A $3\times3$ convolution followed by ReLU outputs a single-channel density map.
Ground-Truth Construction:
- Each object center $(\mu_x, \mu_y)$ is assigned an adaptive Gaussian kernel with standard deviation $\sigma_i = \frac{1}{2}\sqrt{H_i^2 + W_i^2}$ :
$D_i(x,y) = \frac{1}{2\pi\sigma_i^2} \exp\left(-\frac{(x-\mu_x)^2 + (y-\mu_y)^2}{2\sigma_i^2}\right)$
Loss:
- Per-pixel mean squared error (MSE):
$L_{\mathrm{dense}} = \frac{1}{M N} \sum_{x,y} (D_{\mathrm{gt}}(x,y) - D_{\mathrm{pred}}(x,y))^2$ - Overall training combines classification, localization, and density supervision: $L = L_{\mathrm{cls}} + L_{\mathrm{reg}} + \lambda L_{\mathrm{dense}}$ (default $\lambda = 1$ ).

This explicit density modeling introduces quantifiable spatial priors, enabling the subsequent modules to adaptively allocate resources to high-density regions (Zhao et al., 28 Dec 2025).

3. Dense Area Focusing Module (DAFM)

The DAFM exploits the density map to precisely localize and efficiently process object-rich subregions:

Region Selection: Identifies the top- $\tau$ percentile of pixels in the density map, clusters these with K-Means ( $k=2$ ), and merges the resulting bounding rectangles to define a refined binary mask $3\times3$ 0.
Surrogate Representation: Applies $3\times3$ 1 (elementwise product) and $3\times3$ 2 average pooling, followed by $3\times3$ 3 convolution to yield surrogate feature $3\times3$ 4:

$3\times3$ 5

Agent-Style Attention: Performs a two-stage interaction between global features $3\times3$ 6 and the compact representation $3\times3$ 7:
- $3\times3$ 8
- $3\times3$ 9
Output Update: Integrates local details via depthwise separable convolution: $(\mu_x, \mu_y)$ 0.

By constraining attention to the surrogate $(\mu_x, \mu_y)$ 1, DAFM achieves near-global context modeling with substantially reduced computational cost, effectively suppressing background while capturing salient detail in crowded regions (Zhao et al., 28 Dec 2025).

4. Dual Filter Fusion Module (DFFM)

DFFM enhances feature discrimination and boundary sensitivity by explicitly disentangling multi-scale cues:

Branching: Each feature map $(\mu_x, \mu_y)$ 2 is split into four branches (average pooling with kernels $(\mu_x, \mu_y)$ 3 and an unpooled branch), each passed through an EDH block.
Frequency Decomposition (EDH Block):
- Transforms $(\mu_x, \mu_y)$ 4 via 2D DCT to obtain $(\mu_x, \mu_y)$ 5.
- Generates a low-frequency mask $(\mu_x, \mu_y)$ 6; $(\mu_x, \mu_y)$ 7.
- Separates bands: $(\mu_x, \mu_y)$ 8, $(\mu_x, \mu_y)$ 9.
- Inverse DCT reconstructs spatial domains; low-frequency branch gets channel attention (CA), high-frequency branch gets spatial attention (SA), resulting in $\sigma_i = \frac{1}{2}\sqrt{H_i^2 + W_i^2}$ 0 and $\sigma_i = \frac{1}{2}\sqrt{H_i^2 + W_i^2}$ 1.
Density-Guided Cross-Attention: Fusion weights $\sigma_i = \frac{1}{2}\sqrt{H_i^2 + W_i^2}$ 2 are computed via:

$\sigma_i = \frac{1}{2}\sqrt{H_i^2 + W_i^2}$ 3

Output Fusion: Each branch output $\sigma_i = \frac{1}{2}\sqrt{H_i^2 + W_i^2}$ 4, with branch summation for final output.

This mechanism adaptively emphasizes fine edge information (high-frequency) in dense regions while integrating broader semantic context, leading to improved localization and reduced background interference (Zhao et al., 28 Dec 2025).

5. Training Paradigm

The overall loss function is $\sigma_i = \frac{1}{2}\sqrt{H_i^2 + W_i^2}$ 5:

Classification Loss: Focal loss or cross-entropy, as customary in object detection.
Regression Loss: CIoU or $\sigma_i = \frac{1}{2}\sqrt{H_i^2 + W_i^2}$ 6 loss.
Density Loss: Pixel-wise MSE as described above.

Supervising the DGB in tandem with standard detection objectives was observed to yield progressively refined density priors as training proceeds, enhancing end-to-end feature learning through a positive feedback effect (Zhao et al., 28 Dec 2025).

6. Experimental Evaluation

DRMNet's effectiveness is validated on AI-TOD and DTOD, two benchmarks with extremely dense, small objects:

Dataset	Backbone	AP	AP_50	AP_75	AP_vt	Highlighted Gains
AI-TOD	DarkNet-53	31.9	65.0	26.4	13.3	+1.8 (AP) vs. YOLOv8, +2.9 (AP_50)
DTOD	CSPDarkNet-s	—	26.5	4.0	—	+2.3 (AP_50) vs. SCDNet

Key findings include:

On AI-TOD, DRMNet attains AP = 31.9, AP_50 = 65.0, AP_vt = 13.3, surpassing state-of-the-art detectors (e.g., FSANet, MENet, BAFNet).
On DTOD, DRMNet yields AP_50 = 26.5 (+2.3 over SCDNet), outperforming YOLOv8/v10/v11 by up to 16.5%.
Ablations on AI-TOD confirm the additive contribution of each component: DGB (+0.8 AP_50), DAFM (+2.1 AP_50, +0.9 AP_vt), DFFM (+0.8 AP_50). The full model achieves +2.9 AP_50, +2.7 AP_vt over baseline YOLOv8.
DAFM achieves comparable AP to global multi-head self-attention (MSA) while reducing FLOPs by ~75% (Zhao et al., 28 Dec 2025).

7. Functional Insights and Significance

The introduction of explicit density maps as guidance focuses modeling capability on object-rich “islands” amidst large, sparse scenes, reducing redundant computation and background confusion. DAFM’s surrogate extraction yields informative representations of clusters, making high-context attention tractable in real-time. DFFM’s frequency-domain separation and density-guided cross-attention enable context-sensitive fusion, sharpening object boundaries and improving recall of tiny instances. Joint density and detection supervision fosters a synergistic dynamic, repeatedly refining the alignment between density priors and feature representations as training advances.

Collectively, DRMNet’s architecture—grounded in density modeling, adaptive regional attention, and frequency-based regularization—establishes state-of-the-art performance for dense, tiny-object detection in remote sensing scenarios (Zhao et al., 28 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Learning Where to Focus: Density-Driven Guidance for Detecting Dense Tiny Objects (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dense Region Mining Network (DRMNet).