Dual-Stream Saliency Enhancement Block
- The paper demonstrates that integrating local convolutional processing with direction-aware global context significantly improves detection precision and reduces false alarms.
- The DSE block employs a dual-stream architecture to capture fine spatial details and long-range contextual information effectively.
- Empirical results on the IRSTD-1K dataset confirm enhancements in IoU, detection probability, and reduced false alarm rates.
The Dual-stream Saliency Enhancement (DSE) block is a neural module introduced in the DCCS-Det detector for infrared small target detection (IRSTD), designed to address challenges in modeling both local fine details and long-range directional context in complex scenes. DSE achieves this by integrating two parallel computational streams—Local Perception and Direction-Aware Context—and fusing their outputs with channel-wise attention and residual connections. Its empirical efficacy is validated via ablation studies, which show measurable improvements in both precision and target-background discrimination (Li et al., 23 Jan 2026).
1. Architectural Description
The DSE block processes an input tensor by first splitting it along the channel dimension into left and right sub-tensors, and , each of shape . The left branch executes a local convolutional sequence, while the right branch synthesizes direction-aware global context using a specialized scanning and modeling procedure (SS2D). Both branches apply channel-wise attention mechanisms and their outputs are concatenated, shuffled, and merged with the input via a residual connection.
2. Local Perception Stream
The Local Perception stream receives and applies consecutive convolutions with ReLU activation, followed by a point-wise convolution:
- The output preserves spatial resolution and channel dimension . This stream emphasizes local spatial details and fine target structure, serving as the "local saliency map" .
3. Direction-Aware Context Stream
The Direction-Aware Context stream operates on as follows:
- LayerNorm normalization:
- Linear projection (channel expansion) precedes depthwise convolution with SiLU:
- SS2D scans in four primary directions (vertical-forward, horizontal-forward, vertical-backward, horizontal-backward), unfolding along each direction into a 1D sequence for .
- Each sequence is processed by an SSM block (S6), refolded to 2D as , and aggregated: .
- Second LayerNorm:
- Parallel gating path:
- Element-wise modulation and final linear projection:
This stream synthesizes global, direction-sensitive context and handles long-range spatial dependencies. The aggregated direction-context map is denoted .
4. Attention Mechanisms and Fusion
Each branch computes channel-wise attention weights () via global average pooling and max pooling (, ), followed by a sequence of convolutions and sigmoid activation:
Branch outputs are modulated by their attention vectors and concatenated (channel-wise):
Channel shuffle is applied for mixing feature information, followed by a residual sum with the original input:
The final output matches the input tensor’s shape, enabling direct integration into backbone architectures without spatial or channel misalignment.
5. Mathematical Formulation
Summarizing the main transformations in mathematical notation for the DSE block:
- Local stream:
- Directional context: For ,
- Final fusion:
All convolutions in the block, unless specified otherwise, are followed by normalization (LayerNorm/BatchNorm) and SiLU or ReLU activations.
6. Hyperparameters and Computational Properties
The DSE block sets channels per branch at , utilizes kernels for both standard and depthwise convolutions, and kernels for projections and attention computations. SS2D applies four directional scans with stride 1 and dilation 1; the S6 block state dimension is typically 64 (per the DCCS-Det code). Each DSE block introduces approximately 80,000 parameters and 0.12 GFLOPs at resolution; in DCCS-Det, four such blocks collectively account for 0.32M parameters and 0.48 GFLOPs, indicating moderate incremental cost relative to base network complexity.
| Component | Parameters (K) | GFLOPs (256×256) |
|---|---|---|
| DSE Block (single) | ~80 | ~0.12 |
| DSE Block ×4 (full) | ~320 | ~0.48 |
7. Empirical Performance and Evidence
Ablation studies on the IRSTD-1K dataset quantitatively establish the enhancement brought by the DSE block. Integrating DSE (without LaSEA) into the DCCS-Det backbone delivers:
- IoU: 68.05% (vs 67.16% baseline; =+0.89)
- Detection probability : 95.24% (vs 93.88%)
- False alarm rate : (improved from )
This improvement is attributed to joint modeling of local features (via the convolutional stream) and long-range, directionally contextual features (via SS2D) (Li et al., 23 Jan 2026). The design thereby addresses spatial discriminability and semantic retention, key challenges in IR small target detection. A plausible implication is applicability to other domains requiring fine-grained local/global feature synthesis under weak signal conditions.
8. Context, Limitations, and Related Work
The DSE block is specialized for infrared small target detection where local-background discrimination and context-aware aggregation are crucial. Its dual-stream construction and SS2D directional mechanism distinguish it from purely convolutional or transformer-based attention modules, mitigating semantic dilution and redundancy typical in deep CNN backbones. While DSE’s incremental FLOPs and parameter overhead are moderate, scalability and generalization to larger networks or different input modalities remain for further empirical assessment. The modular nature of the block facilitates integration into alternative architectures, supporting reproducibility via official code release (Li et al., 23 Jan 2026).
9. Summary of Contributions and Research Significance
The Dual-stream Saliency Enhancement block provides an explicit, computationally efficient dual-branch fusion model, validated by measurable improvements in IRSTD benchmarks. Its structured combination of convolutional local perception and direction-aware scanning advances joint feature modeling for robust target-background discrimination, representing a substantive contribution to architecture design for low-contrast object detection tasks.