Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual-Stream Saliency Enhancement Block

Updated 30 January 2026
  • The paper demonstrates that integrating local convolutional processing with direction-aware global context significantly improves detection precision and reduces false alarms.
  • The DSE block employs a dual-stream architecture to capture fine spatial details and long-range contextual information effectively.
  • Empirical results on the IRSTD-1K dataset confirm enhancements in IoU, detection probability, and reduced false alarm rates.

The Dual-stream Saliency Enhancement (DSE) block is a neural module introduced in the DCCS-Det detector for infrared small target detection (IRSTD), designed to address challenges in modeling both local fine details and long-range directional context in complex scenes. DSE achieves this by integrating two parallel computational streams—Local Perception and Direction-Aware Context—and fusing their outputs with channel-wise attention and residual connections. Its empirical efficacy is validated via ablation studies, which show measurable improvements in both precision and target-background discrimination (Li et al., 23 Jan 2026).

1. Architectural Description

The DSE block processes an input tensor X∈RH×W×CX \in \mathbb{R}^{H \times W \times C} by first splitting it along the channel dimension into left and right sub-tensors, X(L)X^{(L)} and X(R)X^{(R)}, each of shape RH×W×(C/2)\mathbb{R}^{H \times W \times (C/2)}. The left branch executes a local convolutional sequence, while the right branch synthesizes direction-aware global context using a specialized scanning and modeling procedure (SS2D). Both branches apply channel-wise attention mechanisms and their outputs are concatenated, shuffled, and merged with the input via a residual connection.

2. Local Perception Stream

The Local Perception stream receives X(L)X^{(L)} and applies consecutive 3×33 \times 3 convolutions with ReLU activation, followed by a 1×11 \times 1 point-wise convolution:

  • Y0(L)=ReLU(Conv3×3(X(L)))Y_0^{(L)} = \mathrm{ReLU}(\mathrm{Conv}_{3\times3}(X^{(L)}))
  • Y1(L)=ReLU(Conv3×3(Y0(L)))Y_1^{(L)} = \mathrm{ReLU}(\mathrm{Conv}_{3\times3}(Y_0^{(L)}))
  • Y(L)=PWConv(Y1(L))Y^{(L)} = \mathrm{PWConv}(Y_1^{(L)}) The output Y(L)Y^{(L)} preserves spatial resolution and channel dimension (H×W×C/2)(H \times W \times C/2). This stream emphasizes local spatial details and fine target structure, serving as the "local saliency map" SlocS_\text{loc}.

3. Direction-Aware Context Stream

The Direction-Aware Context stream operates on X(R)X^{(R)} as follows:

  • LayerNorm normalization: Y0(R)=LN(X(R))Y_0^{(R)} = \mathrm{LN}(X^{(R)})
  • Linear projection (channel expansion) precedes depthwise 3×33 \times 3 convolution with SiLU:

Z=SiLU(DWConv3×3(Linear(Y0(R))))Z = \mathrm{SiLU}(\mathrm{DWConv}_{3\times3}(\mathrm{Linear}(Y_0^{(R)})))

  • SS2D scans ZZ in four primary directions (vertical-forward, horizontal-forward, vertical-backward, horizontal-backward), unfolding ZZ along each direction into a 1D sequence SdS_d for d∈{V-FWD,H-FWD,V-BWD,H-BWD}d \in \{V\text{-FWD}, H\text{-FWD}, V\text{-BWD}, H\text{-BWD}\}.
  • Each sequence SdS_d is processed by an SSM block (S6), refolded to 2D as Sd′S'_d, and aggregated: Y1(R)=∑d=14Sd′Y_1^{(R)} = \sum_{d=1}^4 S'_d.
  • Second LayerNorm: Y2(R)=LN(Y1(R))Y_2^{(R)} = \mathrm{LN}(Y_1^{(R)})
  • Parallel gating path: Y3(R)=SiLU(Linear(Y0(R)))Y_3^{(R)} = \mathrm{SiLU}(\mathrm{Linear}(Y_0^{(R)}))
  • Element-wise modulation and final linear projection: Y(R)=Linear(Y2(R)⊙Y3(R))Y^{(R)} = \mathrm{Linear}(Y_2^{(R)} \odot Y_3^{(R)})

This stream synthesizes global, direction-sensitive context and handles long-range spatial dependencies. The aggregated direction-context map is denoted CaggC_\text{agg}.

4. Attention Mechanisms and Fusion

Each branch computes channel-wise attention weights A(i)A^{(i)} (i∈{L,R}i \in \{L, R\}) via global average pooling and max pooling (GavgG_\text{avg}, GmaxG_\text{max}), followed by a sequence of 1×11 \times 1 convolutions and sigmoid activation:

  • A(i)=σ(Conv1×12(ReLU(Conv1×1(Gavg+Gmax))))A^{(i)} = \sigma(\mathrm{Conv}_{1\times1}^2(\mathrm{ReLU}(\mathrm{Conv}_{1\times1}(G_\text{avg} + G_\text{max}))))

Branch outputs are modulated by their attention vectors and concatenated (channel-wise):

  • Yc=Concat(Y(L)⊙A(L),  Y(R)⊙A(R))Y_c = \mathrm{Concat}(Y^{(L)} \odot A^{(L)},\; Y^{(R)} \odot A^{(R)})

Channel shuffle is applied for mixing feature information, followed by a residual sum with the original input:

  • Yout=Shuffle(Yc)+XY_\text{out} = \mathrm{Shuffle}(Y_c) + X

The final output YoutY_\text{out} matches the input tensor’s shape, enabling direct integration into backbone architectures without spatial or channel misalignment.

5. Mathematical Formulation

Summarizing the main transformations in mathematical notation for the DSE block:

  • Local stream: Sloc=PWConv(ReLU(Conv3×3(ReLU(Conv3×3(X(L))))))S_\text{loc} = \mathrm{PWConv}(\mathrm{ReLU}(\mathrm{Conv}_{3\times3}(\mathrm{ReLU}(\mathrm{Conv}_{3\times3}(X^{(L)})))))
  • Directional context: For d∈D={V-FWD,H-FWD,V-BWD,H-BWD}d \in D=\{V\text{-FWD}, H\text{-FWD}, V\text{-BWD}, H\text{-BWD}\},
    • Sd=expandd(Z)S_d = \text{expand}_d(Z)
    • Cd=S6(Sd)C_d = \text{S6}(S_d)
    • Cagg=∑d=14CdC_\text{agg} = \sum_{d=1}^4 C_d
  • Final fusion: Sout=Shuffle([Sloc⊙A(L),Cagg⊙A(R)])+XS_\text{out} = \mathrm{Shuffle}([S_\text{loc} \odot A^{(L)}, C_\text{agg} \odot A^{(R)}]) + X

All convolutions in the block, unless specified otherwise, are followed by normalization (LayerNorm/BatchNorm) and SiLU or ReLU activations.

6. Hyperparameters and Computational Properties

The DSE block sets channels per branch at C/2C/2, utilizes 3×33 \times 3 kernels for both standard and depthwise convolutions, and 1×11 \times 1 kernels for projections and attention computations. SS2D applies four directional scans with stride 1 and dilation 1; the S6 block state dimension is typically 64 (per the DCCS-Det code). Each DSE block introduces approximately 80,000 parameters and 0.12 GFLOPs at 256×256256 \times 256 resolution; in DCCS-Det, four such blocks collectively account for ∼\sim0.32M parameters and 0.48 GFLOPs, indicating moderate incremental cost relative to base network complexity.

Component Parameters (K) GFLOPs (256×256)
DSE Block (single) ~80 ~0.12
DSE Block ×4 (full) ~320 ~0.48

7. Empirical Performance and Evidence

Ablation studies on the IRSTD-1K dataset quantitatively establish the enhancement brought by the DSE block. Integrating DSE (without LaSEA) into the DCCS-Det backbone delivers:

  • IoU: 68.05% (vs 67.16% baseline; Δ\Delta=+0.89)
  • Detection probability PdP_d: 95.24% (vs 93.88%)
  • False alarm rate FaF_a: 13.21×10−613.21 \times 10^{-6} (improved from 15.03×10−615.03 \times 10^{-6})

This improvement is attributed to joint modeling of local features (via the convolutional stream) and long-range, directionally contextual features (via SS2D) (Li et al., 23 Jan 2026). The design thereby addresses spatial discriminability and semantic retention, key challenges in IR small target detection. A plausible implication is applicability to other domains requiring fine-grained local/global feature synthesis under weak signal conditions.

The DSE block is specialized for infrared small target detection where local-background discrimination and context-aware aggregation are crucial. Its dual-stream construction and SS2D directional mechanism distinguish it from purely convolutional or transformer-based attention modules, mitigating semantic dilution and redundancy typical in deep CNN backbones. While DSE’s incremental FLOPs and parameter overhead are moderate, scalability and generalization to larger networks or different input modalities remain for further empirical assessment. The modular nature of the block facilitates integration into alternative architectures, supporting reproducibility via official code release (Li et al., 23 Jan 2026).

9. Summary of Contributions and Research Significance

The Dual-stream Saliency Enhancement block provides an explicit, computationally efficient dual-branch fusion model, validated by measurable improvements in IRSTD benchmarks. Its structured combination of convolutional local perception and direction-aware scanning advances joint feature modeling for robust target-background discrimination, representing a substantive contribution to architecture design for low-contrast object detection tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual-stream Saliency Enhancement (DSE) Block.