RGAM: Reflection Guided Attention Module
- RGAM is a reflection-aware neural attention module that dynamically fuses features for robust reflection removal and glass detection.
- It employs learned spatial- and channel-wise masks to guide feature selection, adapting U-Net skip connections for improved reconstruction.
- Empirical results show significant boosts in performance, with notable improvements in PSNR for reflection removal and IoU for glass detection.
The Reflection Guided Attention Module (RGAM) is a distinct class of neural attention modules that leverage explicit reflection-aware signals to guide feature selection for spatial reconstruction and semantic segmentation in vision tasks involving reflections, notably single image reflection removal (SIRR) and glass surface detection. RGAM achieves effective feature fusion and dynamic gating by learning to selectively emphasize or suppress particular features based on local reflection properties, thus improving robustness in challenging domains where linear superposition or global cues often become insufficient.
1. Core Motivation and Conceptual Foundations
RGAM addresses the inherent challenges in distinguishing between transmission (clean scene) and reflection layers in images acquired through glass or other reflective media, as well as accurately localizing transparent surfaces in unconstrained environments. In reflection removal, the observed image is typically considered as a sum %%%%1%%%% of transmission and reflection . However, strict additivity breaks down in areas of strong reflection, requiring context-driven inpainting rather than mere subtraction. In glass surface detection, differentiating glass from surrounding materials is complicated by transparency and context dependence, with reflections acting as implicit evidence for glass presence.
Key to RGAM’s motivation is the observation that:
- Reflection-heavy regions necessitate a mode switch from difference-based recovery to context encoding or inpainting.
- Weak/no-reflection regions permit direct reliance on reflection-suppressed difference features.
- For glass surface detection, regions exhibiting both high reflection and glass-like morphology are the most discriminative.
By learning spatial- and channel-wise attention masks or joint attention maps, RGAM can dynamically reconcile local validity of linear models or context-driven semantic cues across varying reflection intensities (Li et al., 2020, Yan et al., 21 Nov 2025).
2. Detailed Architectures in Reflection Removal and Glass Detection
2.1. Two-Stage SIRR with Reflection-Aware Guidance (Li et al., 2020)
RGAM (termed “RAG module”) is embedded at each decoder stage within a two-stage U-Net cascade:
- Stage 1: Predicts reflection map via a single-input U-Net ().
- Stage 2: Dual-encoder U-Net () receives both the observation and the previously estimated reflection.
- At decoder level , the module:
- Computes the “difference feature” from encoder outputs.
- Concatenates and passes through two consecutive convs and sigmoid to predict per-channel spatial masks .
- Stacks and reweights/combines via partial convolution with .
- Outputs are processed by a convolution + ReLU stack.
This “gating” adapts skip connections and decoding based on local reflection confidence.
2.2. Multi-Scale Fusion in NFGlassNet (Yan et al., 21 Nov 2025)
In glass detection, RGAM fuses per-scale features from two backbone streams (flash, no-flash) and a reflection feature from the Reflection Contrast Mining Module (RCMM):
For each scale , RGAM receives:
- : A convolution projection of concatenated no-flash and flash features.
- : Reflection feature from RCMM.
- Both and are reshaped and normalized per head in a multi-head architecture.
- Two parallel cross-attention branches are computed:
- “Top” branch uses as query, as key/value.
- “Bottom” branch reverses the role.
- Attention maps and are shifted (minimum-zeroed), multiplicatively fused, and softmaxed to yield shared attention .
- Final features from both sources are reweighted and summed to produce , which is used by the decoder to generate glass masks.
This approach emphasizes those regions expressing both glass-consistent structure and flash-induced reflections.
3. Mathematical Formalization
3.1. Reflection Removal (Li et al., 2020)
- Difference Feature:
- Learned Masking:
- Partial Convolution:
- Mask Loss:
3.2. Glass Detection (Yan et al., 21 Nov 2025)
- Feature Construction:
- Cross-Attention (two branches):
Symmetrically for in the alternate branch.
- Fusing:
4. Implementation Specifics and Parameterization
| Model | Feature Preparation | Attention/Gating | Downstream Usage |
|---|---|---|---|
| RAGNet (Li et al., 2020) | Dual encoders; encoder difference | Channel/spatial masks + PConv | Decoder block gating |
| NFGlassNet (Yan et al., 21 Nov 2025) | Backbone streams + RCMM | Dual-head cross-attention fusion | Scale-wise multi-stream fusion |
- Reflection removal RGAM uses 1×1 convs for mask prediction, 3×3 partial convs, and per-channel mask splitting; no batch norm is present in the attention submodule.
- Glass detection RGAM uses Kaiming/Xavier init for projections, multi-head channels, LayerNorm, no explicit dropout in the module, and fuses feature maps at multiple scales.
Both designs prioritize local adaptive gating to modulate information flow depending on reflection presence or glass context.
5. Training Objectives and Losses
In reflection removal (Li et al., 2020), RGAM has direct loss supervision on its mask outputs with dedicated terms (), penalizing incorrect mask activations in strong or weak reflection regions. The full objective combines reconstruction, perceptual, exclusion, adversarial, and mask-specific losses: with weights , , .
In glass detection (Yan et al., 21 Nov 2025), RGAM does not receive direct supervision; it is optimized solely via the end-to-end task losses, which include IoU and binary cross-entropy for glass mask prediction and for reflection estimation.
6. Empirical Ablations and Observed Effects
Ablation studies in both domains confirm RGAM’s critical role:
- In SIRR (Li et al., 2020), removing the difference feature or using naïve skip connections induces marked PSNR drops (e.g., dB on the Real20 set, dB on SIR Wild). Disabling learned masks or using single-channel masks reduces restoration quality by dB.
- For glass detection (Yan et al., 21 Nov 2025), ablations show RGAM boosts IoU by points; omitting shared attention reduces IoU by , and replacing the cross-stream querying (“alternate-Q”) loses $0.9$–$1.3$ IoU points. Removing the attention shift step degrades IoU by .
Qualitative outputs display sharper, artifact-free transmission predictions and precise glass mask localization in regions where both reflection and glass indicators co-occur.
7. Interpretation and Context within the Field
RGAM modules generalize the use of explicit physical reasoning—reflection detection or suppression—via learnable dynamic attention mechanisms at both feature and spatial scales. The studied variants demonstrate that channel-wise and spatially varying gating, informed by learned or mined reflection signals, outperforms global or hard-coded approaches. These results substantiate that dynamically adaptive fusion, as enabled by RGAM, respects nonuniform physical priors (e.g., breakdown of linear superposition) and enhances performance in scenarios with ambiguous or spatially varying cues.
While the specific instantiations in reflection removal and glass detection differ architecturally, RGAM consistently outperforms naïve fusion or attentionless baselines, supporting its generality as a reflection-aware feature fusion paradigm (Li et al., 2020, Yan et al., 21 Nov 2025).