Two-Way Guided Fusion Module (TGFM)
- The paper demonstrates that TGFM effectively fuses coarse high-level semantics with precise low-level details through bi-directional attention, improving detection metrics.
- It employs spatial and channel attention modules to generate context-aware masks that guide the fusion process for enhanced feature representation.
- Empirical results show significant gains in IoU and mAP for infrared small-target and multi-modal 3D object detection, outperforming baseline methods.
A Two-Way Guided Fusion Module (TGFM) is a neural network component designed for the selective and reciprocal fusion of two feature representations. It imposes bi-directional guidance between streams—typically bridging the semantic-rich but spatially coarse high-level features with the spatially precise but semantically weak low-level features—or, in the multi-modal case, reciprocally merges heterogeneous modalities such as LiDAR and camera representations. TGFM addresses deficiencies in naïve aggregation by computing attention-based masks or gates for both streams, enabling dynamic, context-aware fusion. This approach has demonstrated efficacy in tasks such as infrared small-target detection and multi-modal 3D object detection by reinforcing critical spatial and semantic cues (Zhao et al., 10 Dec 2025, Jia et al., 13 Nov 2025).
1. Motivation and Conceptual Principles
TGFM was introduced to overcome the limitations of uniform fusion (such as direct addition or concatenation), which ignores the distinct strengths of low- and high-level representations. In spatial domains, low-level features retain granular edge and texture information but are deficient in semantic abstraction; conversely, high-level features excel in semantics but lack localization precision. Similarly, in multi-modal 3D object detection, a gap in information density exists between camera and LiDAR modalities for different instance classes (e.g., distant/small vs. occluded objects).
TGFM leverages a two-way attention framework:
- Spatial guidance (low → high): Low-level detail maps generate a spatial-attention mask to guide high-level features, emphasizing spatially salient locations.
- Channel guidance (high → low): High-level (semantic) features produce a channel-attention mask to gate low-level representations, reinforcing relevant semantic channels.
In multi-modal contexts, this paradigm generalizes to bidirectional, instance-aware enhancement: features from one modality or branch selectively update the other, contingent on task-relevant cues such as instance difficulty (Zhao et al., 10 Dec 2025, Jia et al., 13 Nov 2025).
2. Architectural Formulation
The canonical TGFM comprises two parallel attention modules operating between pairs of feature maps or modality streams, followed by additive or concatenative fusion. The architecture adapts to both spatial (image-domain) and instance-level (multi-modal) scenarios.
Spatial Attention Module (SAM)
- Input: Low-level feature map
- Procedure: Channel-wise average and max pooling yield . Concatenation, convolution, and a sigmoid produce the spatial attention map .
- Guidance: is broadcast-multiplied into the high-level feature map .
Channel Attention Module (CAM)
- Input: High-level feature map
- Procedure: Spatial global average and max pooling yield . A shared two-layer MLP produces the channel attention vector via sigmoid activation.
- Guidance: (reshaped to ) is broadcast-multiplied into .
Multi-Modal/Instance-level Fusion (e.g., DGFusion)
- Instance Feature Generation: Extracts object proposal features for both modalities.
- Difficulty-aware Instance Pair Matcher (DIPM): Matches instances into easy (overlapping) and hard (modality-specific) pairs using IoU and intra-modal similarity.
- Guided Update: Each modality’s BEV feature map is enhanced using attention-weighted sums of the matched instance features projected into the appropriate channel space.
- Final Fusion: Enhanced features are concatenated and passed to downstream heads.
3. Mathematical Formalism
Single-Modality TGFM (Image Domain)
Let denote element-wise multiplication (with broadcasting as needed):
- Fusion equation:
- Channel Attention:
- Spatial Attention:
Dual-Modal TGFM (Instance-level Fusion)
Given BEV features , and instance sets , ,
- Attention-weighted update (for modality , enhanced by ):
where is a learned projection and the attention weights are softmaxed dot products between projected instance features and local context.
4. Implementation and Pseudocode
A typical TGFM pipeline is structured as follows, assuming PyTorch-like pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
c_avg = GlobalAvgPool2d(Y) # [B, C2] c_max = GlobalMaxPool2d(Y) # [B, C2] mlp_avg = FC2(ReLU(FC1(c_avg))) mlp_max = FC2(ReLU(FC1(c_max))) c_attn = Sigmoid(mlp_avg + mlp_max) # [B, C2] C_map = c_attn.view(B, C2, 1, 1) X_guided = X * C_map # [B, C2, H, W] a_avg = X.mean(dim=1, keepdim=True) # [B, 1, H, W] a_max = X.max(dim=1, keepdim=True).values # [B, 1, H, W] a_cat = torch.cat([a_avg, a_max], dim=1) # [B, 2, H, W] S_map = Sigmoid(Conv2d(a_cat, out_channels=1, kernel_size=kernel_s, padding=kernel_s//2)) Y_guided = Y * S_map # [B, C2, H, W] Z = X_guided + Y_guided |
5. Empirical Results and Ablation Studies
TGFM integration yields substantial improvements over baseline fusion methods.
| Fusion Scheme | IoU | nIoU |
|---|---|---|
| Add (baseline) | 0.8062 | 0.7798 |
| CAM-only | 0.8114 | 0.7810 |
| SAM-only | 0.8119 | 0.7812 |
| TGFM (full) | 0.8142 | 0.7858 |
Combined spatial and channel attention delivers +0.99% absolute IoU and +0.77% nIoU improvement relative to direct addition on the NUAA-SIRST dataset (Zhao et al., 10 Dec 2025).
For multi-modal detection, the dual-guided approach achieves:
- +1.0 mAP (70.2→71.2) and +0.8 NDS on nuScenes test
- Gains are most pronounced for hard instances:
- Traffic Cone: +2.1 AP
- Bike: +1.4 AP
- Objects >40 m: mAP +1.2
- Low-visibility: +0.8 mAP
- Small objects: +0.5 mAP
Small-data regimes also benefit, with up to +7.5 mAP improvement using 1.3% labeled data (Jia et al., 13 Nov 2025).
6. Integration Guidance and Practical Considerations
TGFM is agnostic to the specific backbone, requiring only that input feature maps be co-registered spatially (or at the instance level for multi-modal applications). Key implementation parameters include:
- Reduction ratio : Default for channel attention, as increased model compactness with minimal performance trade-off.
- SAM convolution: kernel without batch normalization; sigmoid activation.
- CAM MLP: Two-layer, hidden dimension , ReLU activation.
- Efficient integration: Pooling operations and MLPs introduce negligible overhead; total additional parameters are in representative architectures.
When feature map sizes differ, spatial alignment is achieved via bilinear upsampling or convolution with stride-2 downsampling. For hierarchical architectures, TGFM can be cascaded to construct a pyramid fusion structure.
In instance-level scenarios, TGFM is implemented after instance-level feature extraction and matching, consuming pairs generated by DIPM. The dominant runtime is attributed to instance feature generation, while TGFM itself adds ms per batch (Jia et al., 13 Nov 2025).
7. Impact, Limitations, and Extensions
TGFM has established improvements in small-target detection, boundary refinement, and robust multi-modal object detection, especially in conditions characterized by small, distant, or occluded targets. This bi-directional, task-adaptive guidance effectively leverages complementary strengths between fused streams.
Current limitations include sensitivity to the quality of prior instance feature extraction and matching. In multi-modal fusion, proposal head inaccuracies can attenuate TGFM’s efficacy. Further, extension to n-way fusion (more than two modalities or processing scales) is plausible by pairwise chaining or constructing explicit fusion trees.
Potential research directions include:
- Learnable scalar gating for blending directional contributions.
- Extension to additional modalities (e.g., radar, thermal).
- Dynamic adjustment of matching thresholds or attention scaling, conditioned on input characteristics.
- Optimized feature/proposal selection to reduce upstream computational cost (Zhao et al., 10 Dec 2025, Jia et al., 13 Nov 2025).