Dual Global Pooling Feature Fusion

Updated 5 December 2025

Dual Global Pooling Feature Fusion is a technique that employs parallel max and average pooling to preserve multi-scale features during spatial down-sampling in CNNs.
The DPDF block integrates partial convolution with sequential spatial and channel attention modules, enabling efficient edge and context fusion through a learnable coefficient.
Empirical results in YOLO-FireAD demonstrate that DPDF improves mAP and reduces FLOPs, highlighting its practical benefits for small-object and fire detection scenarios.

Dual Global Pooling Feature Fusion, specifically the Dual Pool Downscale Fusion (DPDF) block, is a feature-processing mechanism designed to minimize information loss during spatial down-sampling in convolutional neural networks. Introduced in the YOLO-FireAD model for efficient and accurate fire detection, DPDF employs parallel max and average pooling pathways with subsequent lightweight attention and learnable fusion to preserve critical multi-scale feature patterns, particularly mitigating failure in small-object detection scenarios (Pan et al., 27 May 2025).

1. Formal Definition and Block Structure

The DPDF block receives an input feature tensor $X_{\mathrm{in}} \in \mathbb{R}^{C \times H \times W}$ and outputs a downsampled tensor $X_{\mathrm{out}} \in \mathbb{R}^{C' \times \frac{H}{2} \times \frac{W}{2}}$ , typically with $C' = C$ unless a channel reduction policy is invoked. The dual global pooling fusion pipeline proceeds through parallel paths: spatial max pooling and average pooling with $2 \times 2$ kernel and stride 2:

Max pooling: $F_{\max} = \mathrm{MaxPool}_{2 \times 2}(X_{\mathrm{in}})$
Average pooling: $F_{\mathrm{avg}} = \mathrm{AvgPool}_{2 \times 2}(X_{\mathrm{in}})$

Each pooled tensor passes through a partial convolution (PConv), which applies a depthwise $3 \times 3$ convolution to a fixed ratio $r$ of the channels ( $r = \frac{1}{4}$ by default), further followed by sequential spatial and channel attention modules:

$U_{\max} = \mathrm{PConv}(F_{\max}), \quad U_{\mathrm{avg}} = \mathrm{PConv}(F_{\mathrm{avg}})$

$\Tilde{X}_{\max} = \mathrm{CA}(\mathrm{SA}(U_{\max})), \quad \Tilde{X}_{\mathrm{avg}} = \mathrm{CA}(\mathrm{SA}(U_{\mathrm{avg}}))$

Learnable fusion is performed as: $F_{\mathrm{fused}} = \alpha \Tilde{X}_{\max} + (1-\alpha)\Tilde{X}_{\mathrm{avg}}$ where $\alpha \in [0, 1]$ is a learned scalar or channel-wise vector.

A final $1 \times 1$ convolution (optionally with batch normalization and ReLU) produces $X_{\mathrm{out}} = \mathrm{Conv}_{1 \times 1}(F_{\mathrm{fused}})$ . The output then replaces standard stride-2 downsampled features in the network backbone or neck.

2. Dual-Pooling Rationale and Feature Preservation

Max pooling and average pooling offer complementary feature-preserving properties. Max pooling tends to retain strong local activations and preserves sharp edge information, which is crucial for detecting salient, high-contrast features such as flame edges in fire detection. Conversely, average pooling captures broader contextual distributions, preserving low-contrast and smoothly varying structures—such as diffuse smoke or faint flame gradients.

The fusion process leverages a learnable coefficient $\alpha$ , enabling the block to dynamically prioritize "edge emphasis" from max pooling or "context smoothing" from average pooling, according to the discriminative needs of each channel. This adaptivity is especially relevant for preserving visual signatures of small or faint fires, which frequently disappear under aggressive down-sampling.

3. Lightweight Feature Transformation Components

To maximize computational efficiency, each pooling branch incorporates partial convolution before attention modules. Partial convolution (PConv) restricts the application of depthwise convolution to a subset $r$ of the channels, leaving the remaining channels unaltered. This design drops the parameter and FLOPs load while maintaining key transformation capacity.

Following PConv, each branch applies spatial attention (SA) using either cavity or dilated convolutions followed by sigmoid gating to sharpen spatial saliency. Channel attention (CA) deploys a micro-MLP atop global average pooled statistics to reweight the channel dimension, allowing selective amplification or suppression of pooled features before fusion. Notably, DPDF does not use explicit residual connections from input to output, relying instead on the compositional effect of the attention-calibrated fusion and $1 \times 1$ convolution for information flow and compatibility.

4. Network Integration and Downscale Hierarchy

DPDF is architecturally positioned to supplant standard stride-2 pooling or convolution in both backbone and neck stages within the detection network. In YOLO-FireAD, after down-sampling via DPDF, $X_{\mathrm{out}}$ is subsequently aligned with lateral features using upsampling and concatenation operations as part of a bidirectional FPN-style feature aggregation scheme.

The absence of a direct skip or residual path within the DPDF block is offset by its attention-based calibration and fusion, while the use of a $1 \times 1$ convolution ensures channel compatibility for downstream concatenation and fusion steps.

5. Empirical Performance and Ablation Evidence

Ablation studies reported in YOLO-FireAD (Pan et al., 27 May 2025) isolate the contribution of DPDF in the context of fire detection. Table 3 from the cited work demonstrates the following quantitative impact:

Configuration	mAP\textsubscript{50-95} (%)	Params (M)	GFLOPs (G)
Baseline (YOLOv8n)	32.8	3.01	8.1
+DPDF only	34.5	2.52	6.9

DPDF alone improves mAP $_{50}$ by 1.7% and reduces FLOPs by approximately 15%, relative to baseline, without AIR blocks. When used in conjunction with attention-guided inverted residual (AIR) blocks, YOLO-FireAD reaches mAP $_{50-95}$ =34.6% with only 1.45M parameters. This evidences DPDF's central role in small-object detection and efficient model scaling.

6. Specialization for Small Object (Small-fire) Detection

The dual-fusion mechanism proves especially effective in preventing the disappearance of small or low-intensity fire features during down-sampling. Max pooling secures high activation for bright, tiny flame regions, while average pooling maintains visibility for low-contrast smoke and blurred patterns. The learned optimal weighting $\alpha$ enables per-channel adaptation, ensuring faint patterns remain discernible—addressing a key failure mode of standard pooling schemes.

A plausible implication is that similar dual-pool feature fusion blocks could generalize beyond fire detection, wherever small or weakly contrasted objects are subject to feature loss during down-sampling in convolutional architectures.

7. Relation to Broader Object Detection Architectures

DPDF builds upon and extends the design space of pooling-based down-sampling and feature fusion in object detection. By explicitly learning to fuse edge-focused and context-focused pooling outputs, DPDF introduces a flexible, efficient mechanism for controlling feature attenuation and amplification in deep detection backbones and necks. Its integration in a YOLO-style FPN topology demonstrates compatibility with contemporary real-time detection frameworks, achieving state-of-the-art mAP at significant reductions in parameter count and computational footprint compared to YOLOv8n and its successors (Pan et al., 27 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

YOLO-FireAD: Efficient Fire Detection via Attention-Guided Inverted Residual Learning and Dual-Pooling Feature Preservation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual Global Pooling Feature Fusion.