Papers
Topics
Authors
Recent
Search
2000 character limit reached

RGAM: Reflection Guided Attention Module

Updated 28 November 2025
  • RGAM is a reflection-aware neural attention module that dynamically fuses features for robust reflection removal and glass detection.
  • It employs learned spatial- and channel-wise masks to guide feature selection, adapting U-Net skip connections for improved reconstruction.
  • Empirical results show significant boosts in performance, with notable improvements in PSNR for reflection removal and IoU for glass detection.

The Reflection Guided Attention Module (RGAM) is a distinct class of neural attention modules that leverage explicit reflection-aware signals to guide feature selection for spatial reconstruction and semantic segmentation in vision tasks involving reflections, notably single image reflection removal (SIRR) and glass surface detection. RGAM achieves effective feature fusion and dynamic gating by learning to selectively emphasize or suppress particular features based on local reflection properties, thus improving robustness in challenging domains where linear superposition or global cues often become insufficient.

1. Core Motivation and Conceptual Foundations

RGAM addresses the inherent challenges in distinguishing between transmission (clean scene) and reflection layers in images acquired through glass or other reflective media, as well as accurately localizing transparent surfaces in unconstrained environments. In reflection removal, the observed image II is typically considered as a sum %%%%1%%%% of transmission TT and reflection RR. However, strict additivity breaks down in areas of strong reflection, requiring context-driven inpainting rather than mere subtraction. In glass surface detection, differentiating glass from surrounding materials is complicated by transparency and context dependence, with reflections acting as implicit evidence for glass presence.

Key to RGAM’s motivation is the observation that:

  • Reflection-heavy regions necessitate a mode switch from difference-based recovery to context encoding or inpainting.
  • Weak/no-reflection regions permit direct reliance on reflection-suppressed difference features.
  • For glass surface detection, regions exhibiting both high reflection and glass-like morphology are the most discriminative.

By learning spatial- and channel-wise attention masks or joint attention maps, RGAM can dynamically reconcile local validity of linear models or context-driven semantic cues across varying reflection intensities (Li et al., 2020, Yan et al., 21 Nov 2025).

2. Detailed Architectures in Reflection Removal and Glass Detection

RGAM (termed “RAG module”) is embedded at each decoder stage within a two-stage U-Net cascade:

  • Stage 1: Predicts reflection map R^\hat{R} via a single-input U-Net (GRG_R).
  • Stage 2: Dual-encoder U-Net (GTG_T) receives both the observation and the previously estimated reflection.
  • At decoder level ii, the module:

    1. Computes the “difference feature” Fdiffi=FIiFRiF_\mathrm{diff}^i = F_I^i - F_R^i from encoder outputs.
    2. Concatenates [FIi;FRi;Fdeci][F_I^i; F_R^i; F_\mathrm{dec}^i] and passes through two consecutive 1×11 \times 1 convs and sigmoid to predict per-channel spatial masks Mi=[Mdiffi,Mdeci]M^i = [M_\mathrm{diff}^i, M_\mathrm{dec}^i].
    3. Stacks [Fdiffi;Fdeci][F_\mathrm{diff}^i; F_\mathrm{dec}^i] and reweights/combines via partial convolution with MiM^i.
    4. Outputs are processed by a 3×33\times3 convolution + ReLU stack.

This “gating” adapts skip connections and decoding based on local reflection confidence.

In glass detection, RGAM fuses per-scale features from two backbone streams (flash, no-flash) and a reflection feature from the Reflection Contrast Mining Module (RCMM):

  • For each scale ii, RGAM receives:

    • FglassiF_\mathrm{glass}^i: A 1×11 \times 1 convolution projection of concatenated no-flash and flash features.
    • FrefiF_\mathrm{ref}^i: Reflection feature from RCMM.
  • Both FglassiF_\mathrm{glass}^i and FrefiF_\mathrm{ref}^i are reshaped and normalized per head in a multi-head architecture.
  • Two parallel cross-attention branches are computed:
    • “Top” branch uses FrefF_\mathrm{ref} as query, FglassF_\mathrm{glass} as key/value.
    • “Bottom” branch reverses the role.
  • Attention maps MrefM_\mathrm{ref} and MglassM_\mathrm{glass} are shifted (minimum-zeroed), multiplicatively fused, and softmaxed to yield shared attention MsharedM_\mathrm{shared}.
  • Final features from both sources are reweighted and summed to produce FRGAMF_\mathrm{RGAM}, which is used by the decoder to generate glass masks.

This approach emphasizes those regions expressing both glass-consistent structure and flash-induced reflections.

3. Mathematical Formalization

  • Difference Feature:

Fdiffi=FIiFRiF_\mathrm{diff}^i = F_I^i - F_R^i

  • Learned Masking:

Xi=concat(FIi,FRi,Fdeci)R3C×H×WX^i = \mathrm{concat}(F_I^i, F_R^i, F_\mathrm{dec}^i) \in \mathbb{R}^{3C\times H\times W}

Mi=[Mdiffi Mdeci]=σ(Conv1×1(Xi))R2C×H×WM^i = \begin{bmatrix} M_\mathrm{diff}^i \ M_\mathrm{dec}^i \end{bmatrix} = \sigma(\mathrm{Conv}_{1\times1}(X^i)) \in \mathbb{R}^{2C\times H \times W}

  • Partial Convolution:

F(p)={1qN(p)M(q)qN(p)W(q)[FM](q)+b,qM(q)>0 0,otherwiseF'(\mathbf{p}) = \begin{cases} \frac{1}{\sum_{q \in \mathcal{N}(p)} M(q)} \sum_{q \in \mathcal{N}(p)} W(q)[F \circ M](q) + b, & \sum_{q} M(q) > 0\ 0, & \text{otherwise} \end{cases}

  • Mask Loss:

Lmaskdiff=i=14Mdiffi[Rgt>φ]1\mathcal{L}_\mathrm{mask}^\mathrm{diff} = \sum_{i=1}^4 \| M_\mathrm{diff}^i [R_{gt} > \varphi] \|_1

Lmaskreg=i=14Mi[Rgt<ξ]11\mathcal{L}_\mathrm{mask}^\mathrm{reg} = \sum_{i=1}^4 \| M^i [R_{gt} < \xi] - 1 \|_1

Lmask=Lmaskdiff+Lmaskreg\mathcal{L}_\mathrm{mask} = \mathcal{L}_\mathrm{mask}^\mathrm{diff} + \mathcal{L}_\mathrm{mask}^\mathrm{reg}

  • Feature Construction:

Fglass=Conv1×1[Fno,Ffl]F_\mathrm{glass} = \mathrm{Conv}_{1\times1}\bigl[ F_\mathrm{no}, F_\mathrm{fl} \bigr]

  • Cross-Attention (two branches):

Qref=WQt(LN(reshape(Fref))) Kglass=WKt(LN(reshape(Fglass))) Mref=Qref(Kglass)T\begin{aligned} Q_\mathrm{ref} &= W_Q^t(\mathrm{LN}(\mathrm{reshape}(F_\mathrm{ref}))) \ K_\mathrm{glass} &= W_K^t(\mathrm{LN}(\mathrm{reshape}(F_\mathrm{glass}))) \ M_\mathrm{ref} &= Q_\mathrm{ref} (K_\mathrm{glass})^T \end{aligned}

Symmetrically for Qglass,Kref,MglassQ_\mathrm{glass}, K_\mathrm{ref}, M_\mathrm{glass} in the alternate branch.

  • Fusing:

Mshared=softmax((MrefminMref)(MglassminMglass))M_\mathrm{shared} = \mathrm{softmax} \Big( (M_\mathrm{ref} - \min M_\mathrm{ref}) \odot (M_\mathrm{glass} - \min M_\mathrm{glass}) \Big)

FRGAM=reshape(MsharedVglass)+reshape(MsharedVref)F_\mathrm{RGAM} = \mathrm{reshape}(M_\mathrm{shared} \otimes V_\mathrm{glass}) + \mathrm{reshape}(M_\mathrm{shared} \otimes V_\mathrm{ref})

4. Implementation Specifics and Parameterization

Model Feature Preparation Attention/Gating Downstream Usage
RAGNet (Li et al., 2020) Dual encoders; encoder difference Channel/spatial masks + PConv Decoder block gating
NFGlassNet (Yan et al., 21 Nov 2025) Backbone streams + RCMM Dual-head cross-attention fusion Scale-wise multi-stream fusion
  • Reflection removal RGAM uses 1×1 convs for mask prediction, 3×3 partial convs, and per-channel mask splitting; no batch norm is present in the attention submodule.
  • Glass detection RGAM uses Kaiming/Xavier init for projections, multi-head channels, LayerNorm, no explicit dropout in the module, and fuses feature maps at multiple scales.

Both designs prioritize local adaptive gating to modulate information flow depending on reflection presence or glass context.

5. Training Objectives and Losses

In reflection removal (Li et al., 2020), RGAM has direct loss supervision on its mask outputs with dedicated terms (Lmask\mathcal{L}_\mathrm{mask}), penalizing incorrect mask activations in strong or weak reflection regions. The full objective combines reconstruction, perceptual, exclusion, adversarial, and mask-specific losses: L=λ1Lrec+λ2Lpercep+λ3Lexcl+λ4Ladv+λ5Lmask\mathcal{L} = \lambda_1 \mathcal{L}_\mathrm{rec} + \lambda_2 \mathcal{L}_\mathrm{percep} + \lambda_3 \mathcal{L}_\mathrm{excl} + \lambda_4 \mathcal{L}_\mathrm{adv} + \lambda_5 \mathcal{L}_\mathrm{mask} with weights λ1=λ2=λ5=1\lambda_1 = \lambda_2 = \lambda_5 = 1, λ3=0.2\lambda_3=0.2, λ4=0.01\lambda_4=0.01.

In glass detection (Yan et al., 21 Nov 2025), RGAM does not receive direct supervision; it is optimized solely via the end-to-end task losses, which include IoU and binary cross-entropy for glass mask prediction and (1SSIM)+L1(1-\mathrm{SSIM})+L_1 for reflection estimation.

6. Empirical Ablations and Observed Effects

Ablation studies in both domains confirm RGAM’s critical role:

  • In SIRR (Li et al., 2020), removing the difference feature or using naïve skip connections induces marked PSNR drops (e.g., 1.96-1.96 dB on the Real20 set, 0.95-0.95 dB on SIR2^2 Wild). Disabling learned masks or using single-channel masks reduces restoration quality by 1\sim1 dB.
  • For glass detection (Yan et al., 21 Nov 2025), ablations show RGAM boosts IoU by +1.32+1.32 points; omitting shared attention reduces IoU by 0.69-0.69, and replacing the cross-stream querying (“alternate-Q”) loses $0.9$–$1.3$ IoU points. Removing the attention shift step degrades IoU by 0.8\sim0.8.

Qualitative outputs display sharper, artifact-free transmission predictions and precise glass mask localization in regions where both reflection and glass indicators co-occur.

7. Interpretation and Context within the Field

RGAM modules generalize the use of explicit physical reasoning—reflection detection or suppression—via learnable dynamic attention mechanisms at both feature and spatial scales. The studied variants demonstrate that channel-wise and spatially varying gating, informed by learned or mined reflection signals, outperforms global or hard-coded approaches. These results substantiate that dynamically adaptive fusion, as enabled by RGAM, respects nonuniform physical priors (e.g., breakdown of linear superposition) and enhances performance in scenarios with ambiguous or spatially varying cues.

While the specific instantiations in reflection removal and glass detection differ architecturally, RGAM consistently outperforms naïve fusion or attentionless baselines, supporting its generality as a reflection-aware feature fusion paradigm (Li et al., 2020, Yan et al., 21 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reflection Guided Attention Module (RGAM).