Two-Way Guided Fusion Module (TGFM)

Updated 8 January 2026

The paper demonstrates that TGFM effectively fuses coarse high-level semantics with precise low-level details through bi-directional attention, improving detection metrics.
It employs spatial and channel attention modules to generate context-aware masks that guide the fusion process for enhanced feature representation.
Empirical results show significant gains in IoU and mAP for infrared small-target and multi-modal 3D object detection, outperforming baseline methods.

A Two-Way Guided Fusion Module (TGFM) is a neural network component designed for the selective and reciprocal fusion of two feature representations. It imposes bi-directional guidance between streams—typically bridging the semantic-rich but spatially coarse high-level features with the spatially precise but semantically weak low-level features—or, in the multi-modal case, reciprocally merges heterogeneous modalities such as LiDAR and camera representations. TGFM addresses deficiencies in naïve aggregation by computing attention-based masks or gates for both streams, enabling dynamic, context-aware fusion. This approach has demonstrated efficacy in tasks such as infrared small-target detection and multi-modal 3D object detection by reinforcing critical spatial and semantic cues (Zhao et al., 10 Dec 2025, Jia et al., 13 Nov 2025).

1. Motivation and Conceptual Principles

TGFM was introduced to overcome the limitations of uniform fusion (such as direct addition or concatenation), which ignores the distinct strengths of low- and high-level representations. In spatial domains, low-level features retain granular edge and texture information but are deficient in semantic abstraction; conversely, high-level features excel in semantics but lack localization precision. Similarly, in multi-modal 3D object detection, a gap in information density exists between camera and LiDAR modalities for different instance classes (e.g., distant/small vs. occluded objects).

TGFM leverages a two-way attention framework:

Spatial guidance (low → high): Low-level detail maps generate a spatial-attention mask to guide high-level features, emphasizing spatially salient locations.
Channel guidance (high → low): High-level (semantic) features produce a channel-attention mask to gate low-level representations, reinforcing relevant semantic channels.

In multi-modal contexts, this paradigm generalizes to bidirectional, instance-aware enhancement: features from one modality or branch selectively update the other, contingent on task-relevant cues such as instance difficulty (Zhao et al., 10 Dec 2025, Jia et al., 13 Nov 2025).

2. Architectural Formulation

The canonical TGFM comprises two parallel attention modules operating between pairs of feature maps or modality streams, followed by additive or concatenative fusion. The architecture adapts to both spatial (image-domain) and instance-level (multi-modal) scenarios.

Spatial Attention Module (SAM)

Input: Low-level feature map $X \in \mathbb{R}^{C_1 \times H \times W}$
Procedure: Channel-wise average and max pooling yield $A_\text{avg}, A_\text{max} \in \mathbb{R}^{1 \times H \times W}$ . Concatenation, $7 \times 7$ convolution, and a sigmoid produce the spatial attention map $S(X)$ .
Guidance: $S(X)$ is broadcast-multiplied into the high-level feature map $Y$ .

Channel Attention Module (CAM)

Input: High-level feature map $Y \in \mathbb{R}^{C_2 \times H \times W}$
Procedure: Spatial global average and max pooling yield $c_\text{avg}, c_\text{max} \in \mathbb{R}^{C_2}$ . A shared two-layer MLP produces the channel attention vector $C(Y)$ via sigmoid activation.
Guidance: $C(Y)$ (reshaped to $C_2 \times 1 \times 1$ ) is broadcast-multiplied into $X$ .

Instance Feature Generation: Extracts object proposal features for both modalities.
Difficulty-aware Instance Pair Matcher (DIPM): Matches instances into easy (overlapping) and hard (modality-specific) pairs using IoU and intra-modal similarity.
Guided Update: Each modality’s BEV feature map is enhanced using attention-weighted sums of the matched instance features projected into the appropriate channel space.
Final Fusion: Enhanced features are concatenated and passed to downstream heads.

3. Mathematical Formalism

Single-Modality TGFM (Image Domain)

Let $\odot$ denote element-wise multiplication (with broadcasting as needed):

Fusion equation:

$Z = C(Y)\odot X + S(X)\odot Y$

Channel Attention:

$\begin{align*} \hat{c}_{\text{avg}} &= \text{GAP}(Y) \ \hat{c}_{\text{max}} &= \text{GMP}(Y) \ C(Y) &= \sigma\left(\text{MLP}(\hat{c}_{\text{avg}}) + \text{MLP}(\hat{c}_{\text{max}})\right) \end{align*}$

Spatial Attention:

$\begin{align*} A_\text{avg} &= \text{mean}_{c}(X) \ A_\text{max} &= \text{max}_{c}(X) \ S(X) &= \sigma\left(\text{Conv}_{7\times 7}([A_\text{avg}; A_\text{max}])\right) \end{align*}$

Given BEV features $F_L$ , $F_C$ and instance sets $I_L$ , $I_C$ ,

Attention-weighted update (for modality $A$ , enhanced by $B$ ):

$F_A'(x, y) = F_A(x, y) + \sum_{i} A_{B \rightarrow A}(x, y)\left(W_A I_B^i\right)$

where $W_A$ is a learned $1 \times 1$ projection and the attention weights are softmaxed dot products between projected instance features and local context.

4. Implementation and Pseudocode

A typical TGFM pipeline is structured as follows, assuming PyTorch-like pseudocode:

c_avg = GlobalAvgPool2d(Y)  # [B, C2]
c_max = GlobalMaxPool2d(Y)  # [B, C2]
mlp_avg = FC2(ReLU(FC1(c_avg)))
mlp_max = FC2(ReLU(FC1(c_max)))
c_attn = Sigmoid(mlp_avg + mlp_max)  # [B, C2]
C_map = c_attn.view(B, C2, 1, 1)
X_guided = X * C_map  # [B, C2, H, W]

a_avg = X.mean(dim=1, keepdim=True)  # [B, 1, H, W]
a_max = X.max(dim=1, keepdim=True).values  # [B, 1, H, W]
a_cat = torch.cat([a_avg, a_max], dim=1)  # [B, 2, H, W]
S_map = Sigmoid(Conv2d(a_cat, out_channels=1, kernel_size=kernel_s, padding=kernel_s//2))
Y_guided = Y * S_map  # [B, C2, H, W]

Z = X_guided + Y_guided

For multi-modal pipelines, DIPM matching precedes this fusion operation, followed by bidirectional attention-weighted updates and concatenation (Zhao et al., 10 Dec 2025, Jia et al., 13 Nov 2025).

5. Empirical Results and Ablation Studies

TGFM integration yields substantial improvements over baseline fusion methods.

Fusion Scheme	IoU	nIoU
Add (baseline)	0.8062	0.7798
CAM-only	0.8114	0.7810
SAM-only	0.8119	0.7812
TGFM (full)	0.8142	0.7858

Combined spatial and channel attention delivers +0.99% absolute IoU and +0.77% nIoU improvement relative to direct addition on the NUAA-SIRST dataset (Zhao et al., 10 Dec 2025).

For multi-modal detection, the dual-guided approach achieves:

+1.0 mAP (70.2→71.2) and +0.8 NDS on nuScenes test
Gains are most pronounced for hard instances:
- Traffic Cone: +2.1 AP
- Bike: +1.4 AP
- Objects >40 m: mAP +1.2
- Low-visibility: +0.8 mAP
- Small objects: +0.5 mAP

Small-data regimes also benefit, with up to +7.5 mAP improvement using 1.3% labeled data (Jia et al., 13 Nov 2025).

6. Integration Guidance and Practical Considerations

TGFM is agnostic to the specific backbone, requiring only that input feature maps be co-registered spatially (or at the instance level for multi-modal applications). Key implementation parameters include:

Reduction ratio $r$ : Default $r=8$ for channel attention, as increased model compactness with minimal performance trade-off.
SAM convolution: $7 \times 7$ kernel without batch normalization; sigmoid activation.
CAM MLP: Two-layer, hidden dimension $C_2/r$ , ReLU activation.
Efficient integration: Pooling operations and MLPs introduce negligible overhead; total additional parameters are $\leq 1\,\text{M}$ in representative architectures.

When feature map sizes differ, spatial alignment is achieved via bilinear upsampling or $1 \times 1$ convolution with stride-2 downsampling. For hierarchical architectures, TGFM can be cascaded to construct a pyramid fusion structure.

In instance-level scenarios, TGFM is implemented after instance-level feature extraction and matching, consuming pairs generated by DIPM. The dominant runtime is attributed to instance feature generation, while TGFM itself adds $\leq 2$ ms per batch (Jia et al., 13 Nov 2025).

7. Impact, Limitations, and Extensions

TGFM has established improvements in small-target detection, boundary refinement, and robust multi-modal object detection, especially in conditions characterized by small, distant, or occluded targets. This bi-directional, task-adaptive guidance effectively leverages complementary strengths between fused streams.

Current limitations include sensitivity to the quality of prior instance feature extraction and matching. In multi-modal fusion, proposal head inaccuracies can attenuate TGFM’s efficacy. Further, extension to n-way fusion (more than two modalities or processing scales) is plausible by pairwise chaining or constructing explicit fusion trees.

Potential research directions include:

Learnable scalar gating for blending directional contributions.
Extension to additional modalities (e.g., radar, thermal).
Dynamic adjustment of matching thresholds or attention scaling, conditioned on input characteristics.
Optimized feature/proposal selection to reduce upstream computational cost (Zhao et al., 10 Dec 2025, Jia et al., 13 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Gradient-Guided Learning Network for Infrared Small Target Detection (2025)

DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Two-Way Guided Fusion Module (TGFM).

Two-Way Guided Fusion Module (TGFM)

1. Motivation and Conceptual Principles

2. Architectural Formulation

Spatial Attention Module (SAM)

Channel Attention Module (CAM)

3. Mathematical Formalism

Single-Modality TGFM (Image Domain)

4. Implementation and Pseudocode

5. Empirical Results and Ablation Studies

6. Integration Guidance and Practical Considerations

7. Impact, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Two-Way Guided Fusion Module (TGFM)

1. Motivation and Conceptual Principles

2. Architectural Formulation

Spatial Attention Module (SAM)

Channel Attention Module (CAM)

Multi-Modal/Instance-level Fusion (e.g., DGFusion)

3. Mathematical Formalism

Single-Modality TGFM (Image Domain)

Dual-Modal TGFM (Instance-level Fusion)

4. Implementation and Pseudocode

5. Empirical Results and Ablation Studies

6. Integration Guidance and Practical Considerations

7. Impact, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research