Single-Image Reflection Removal (SIRR)

Updated 19 January 2026

Single-image reflection removal is a computer vision problem that separates the clean scene (transmission) from unwanted reflections in a single image.
Recent advances leverage deep learning architectures—from encoder-decoder CNNs to transformers and diffusion models—with frequency analysis and physically-inspired priors to enhance performance.
Ongoing challenges include handling strong, spatially-varying reflections, scaling to ultra-high-definition inputs, and achieving robust domain generalization.

Single-Image Reflection Removal (SIRR) is the fundamental computer vision problem of recovering the underlying transmission image from a single photograph acquired through a reflective medium, most often glass, without any auxiliary supervision such as additional images or sensor modalities. This task is ill-posed due to the non-unique and scene-dependent combination of reflection and transmission components, as well as the variety of real-world degradations introduced by optical phenomena, scene geometry, and non-ideal capture conditions. Research in SIRR has accelerated in recent years, driven by advances in deep learning, the development of more realistic datasets and benchmarks, and the introduction of physically-inspired priors and transformer-based architectures. This article surveys the technical landscape of SIRR, spanning problem modeling, algorithmic paradigms, network architectures, loss functions, benchmarks, and performance trends as evident in contemporary research.

1. Problem Definition and Physical Image Formation

Single-image reflection removal is governed by the assumption that a measured image $I$ is a mixture of two latent layers: the transmission $T$ (the target, or clean scene) and the reflection $R$ (unwanted contribution), with the canonical linear model

$I(x, y) = T(x, y) + R(x, y),$

where $(x, y)$ indexes pixel coordinates (Yang et al., 12 Feb 2025). This simplified model fails to capture attenuation, ghosting, non-uniform blending, and other non-linear effect arising from glass optics (e.g., wavelength-dependent transmission, double reflection due to glass thickness, or spatially varying coefficients). For greater realism, several works adopt a generalized form such as

$I = g(T) + f(R),$

where $g$ and $f$ are nonlinear degradations (blur, attenuation, ghosting, color shift, etc.) (Han et al., 2023). Some models introduce per-pixel mixing coefficients: $I = W \circ T + (1-W) \circ R$ for $W\in[0,1]^{H\times W}$ , or non-linear physical renderings via path-tracing (Guo et al., 12 Jan 2026).

The ill-posedness lies in the non-uniqueness of the decomposition: for a given $T$ 0 and unknown $T$ 1, $T$ 2, and model parameters, multiple solutions exist. Modern methods thus rely critically on trained, data-driven priors, often grounded in massive supervised, semi-supervised, or self-supervised learning (Yang et al., 12 Feb 2025, Lu et al., 2024).

2. Algorithmic Paradigms and Network Architectures

SIRR methodologies have evolved through several dominant paradigms:

Single-Stage Encoder–Decoder CNNs: Early and standard methods employ U-Net–like or ResNet-style architectures that directly map $T$ 3 to $T$ 4 and sometimes $T$ 5; examples include ERRNet, Zhang et al. (CVPR 2018), and various fully-convolutional baselines (Yang et al., 12 Feb 2025).
Two-Stage or Cascaded Networks: These frameworks estimate either $T$ 6 or $T$ 7 (or supporting features such as edge maps or alpha masks) in a preliminary stage, then refine the prediction via a second network. RAGNet (reflection-first, then mask-guided transmission) and CoRRN (edge-aware cascade) are representative (Li et al., 2020).
Multi-Stage and Recurrent Networks: Iterative approaches such as IBCLN leverage cascaded or LSTM-powered refinement sweeps, boosting $T$ 8 and $T$ 9 estimates alternately and propagating information across iterations (Li et al., 2019). LANet and IGEN use recurrent Laplacian or gradient-based encoding (Dong et al., 2020, Bera et al., 2021).
Transformer and Attention-Driven Models: Recent advances have seen the introduction of U-shaped transformer architectures, often interleaved with convolutional modules and frequency-domain processing. F2T2-HiT combines FFT-based transformer blocks with hierarchical window attention (Cai et al., 5 Jun 2025). PromptRR injects frequency (LF/HF) prompts into transformer-based backbone via adaptive prompt blocks (Wang et al., 2024).
Diffusion and Generative Models: Diffusion models, including denoising diffusion probabilistic models (DDPMs) and diffusion transformers (DiT/FLUX.1), have been adapted to SIRR, either in a self-supervised, cycle-consistent setting (Lu et al., 2024), as prompt generators (Wang et al., 2024), or as foundation restoration models adapted via LoRA (Zakarin et al., 4 Dec 2025, Guo et al., 12 Jan 2026).
Plug-in Priors and Interpretability: Explicit priors such as region-adaptive intensity maps (RPEN) (Han et al., 2023), sparsity and exclusion constraints (DExNet) (Huang et al., 3 Mar 2025), and ranged depth guidance (Elnenaey et al., 2024) encode both physical and learned cues, augmenting or constraining the main restoration pathway.

The spectrum of model sizes and representation power is wide: compact, interpretable unfolded-optimization networks like DExNet (9.66M parameters) rival or exceed large, black-box transformer models in performance, owing to explicit modeling via exclusion priors and sparsity (Huang et al., 3 Mar 2025).

3. Frequency Analysis, Prompting, and Specialized Modules

A major insight is that transmission and reflection differ systematically in frequency content—low-frequency (LF) and high-frequency (HF) features respond distinctively to $R$ 0 and $R$ 1. PromptRR operationalizes this by (i) pre-training a frequency prompt encoder to extract LF/HF signatures from ground-truth $R$ 2, (ii) training diffusion models to synthesize such prompts from $R$ 3, and (iii) injecting them into a transformer backbone at each scale (Wang et al., 2024). Ablative studies demonstrate that diffusion-generated prompts outperform CNN-based alternatives by >4 dB in PSNR.

The F2T2-HiT framework targets the persistence of long-range, high-span reflections via dual-branch spectral-spatial attention: FFT domain processing via transformer-attention in frequency space enables the model to suppress globally spread reflection cues, which are otherwise difficult for local convolutions to separate (Cai et al., 5 Jun 2025).

Reflection location awareness is another advanced module: Maximum Reflection Filters (MaxRF) compute local spatial maps by comparing gradient magnitudes in $R$ 4 and $R$ 5, generating region masks indicating probable reflection dominance (Zhu et al., 2023). These masks, predicted by a dedicated reflection detection network (RDNet), guide a subsequent removal network (RRNet), substantially improving artifact suppression and detail preservation in real-world scenes.

4. Loss Functions and Training Objectives

Single-image reflection removal architectures are typically optimized under a combination of complementary loss functions:

Pixel-wise Fidelity (L1/L2): $R$ 6 or $R$ 7, critical for high-PSNR restoration.
Gradient/Edge Losses: Enforce edge alignment by matching spatial derivatives: $R$ 8 (Yang et al., 12 Feb 2025).
Perceptual Losses: Compare deep VGG features between $R$ 9 and $I(x, y) = T(x, y) + R(x, y),$ 0, typically focusing on conv2_2, conv3_2, conv5_2, preserving higher-order features and style (Li et al., 2020, Huang et al., 3 Mar 2025).
Exclusion Losses: Penalize overlap between $I(x, y) = T(x, y) + R(x, y),$ 1 and $I(x, y) = T(x, y) + R(x, y),$ 2 in edge/gradient space, ensuring sparse, non-interfering representations (Li et al., 2020, Huang et al., 3 Mar 2025).
Adversarial Losses: PatchGAN or relativistic GAN losses (used in ReflectNet, RRFormer) enforce naturalness and suppress subtle artifacts (Birhala et al., 2021, Zhang et al., 2023).
Prompt/Diffusion/Prompted Perceptual Losses: Specialized objectives train frequency prompt encoders (PromptRR), joint diffusion step targets (PromptRR, WindowSeat), or reinforce cycle consistency in self-supervised settings (Wang et al., 2024, Lu et al., 2024).
Auxiliary/Projection Losses: Auxiliary losses in DExNet directly constrain exclusion variables, and multi-step losses in iterative/refinement networks encourage progressive improvement, penalizing static or unchanged outputs (Huang et al., 3 Mar 2025, Elnenaey et al., 2024).

Loss ablation studies consistently reveal that perceptual and exclusion losses boost SSIM/PSNR by ~0.5–2 dB, and omission of prompt-interaction or region-guidance leads to marked performance degradation (Wang et al., 2024, Zhu et al., 2023, Huang et al., 3 Mar 2025).

5. Datasets, Evaluation Metrics, and Benchmarking

High-quality, aligned benchmarks are a prerequisite for quantitative SIRR studies:

Synthetic Datasets: CEILNet (7,643 pairs), RID, CDR, SIR $I(x, y) = T(x, y) + R(x, y),$ 3, synthetic LoRA training sets for foundation models (WindowSeat, SIRR-LMM) (Li et al., 2019, Guo et al., 12 Jan 2026).
Real-World and Large-Scale Benchmarks: Nature (220 aligned pairs), Real (Zhang et al. 2018, 89 train/20 test), RRW (14,952 pairs, pixel-aligned) (Zhu et al., 2023), OpenRR-5k (5,300 pairs, smartphone-captured, hand-refined) (Cai et al., 5 Jun 2025), UHDRR4K/UHDRR8K (UHD, up to 7680×4320 px) (Zhang et al., 2023).
Task-Specific Datasets: Museum Reflection Removal (MRR, 2.3k images), CDR (categorized, sharp/blurred/ghosting), SIR $I(x, y) = T(x, y) + R(x, y),$ 4 Postcard/Solid/Wild splits for stratified evaluation (Lu et al., 2024, Han et al., 2023).

Metrics:

Metric	Formula / Intent
PSNR	$I(x, y) = T(x, y) + R(x, y),$ 5
SSIM	Structural similarity index over local windows
LPIPS	Perceptual distance, learned from VGG/ResNet features
DISTS	Deep structure+texture similarity
NIQE	No-reference, “naturalness” score
RAM	Fourier-based reflection artifact measure (Lu et al., 2024)

Recent models consistently report PSNR ≥ 25 dB, SSIM ≥ 0.90 on SIR $I(x, y) = T(x, y) + R(x, y),$ 6, with state of the art—for instance, SIRR-LMM—reaching PSNR 26.2, SSIM 0.913 (Guo et al., 12 Jan 2026).

6. Empirical Advances and Model Comparisons

The field has seen rapid performance improvement, with the following trends:

Transformers surpass CNNs for complex, spatially distributed reflections and UHD images: RRFormer achieves 24.71 dB/0.971 SSIM (4K), outperforming all CNN-based competitors (Zhang et al., 2023).
Frequency and prompt-guided modules (PromptRR, F2T2-HiT) yield higher fidelity, particularly in scenes with diffuse or global reflection spread (Wang et al., 2024, Cai et al., 5 Jun 2025).
DExNet demonstrates that model-based, lightweight (9.66M parameters) unfolded architectures can match or exceed PSNR/SSIM of designs 10–12× larger, owing to explicit exclusion priors (Huang et al., 3 Mar 2025).
Diffusion/adaptive diffusion transformers (WindowSeat, SIRR-LMM) achieve state-of-the-art with high photorealism, generalization to in-the-wild cases, and efficient LoRA fine-tuning (Zakarin et al., 4 Dec 2025, Guo et al., 12 Jan 2026).
Domain generalization via expert ensemble and reflection-type-aware weighting (RTAW + AdaNEC) substantially raises robustness across datasets (Liu et al., 2022).
Cycle-consistent and self-supervised approaches (diffusion-based) now reduce or eliminate the dependence on paired supervision, closing the synthetic–real performance gap (Lu et al., 2024).

A representative results table (averaged over SIR $I(x, y) = T(x, y) + R(x, y),$ 7, Real, Nature, etc.):

Method	PSNR (dB)	SSIM	Notes
DExNet $I(x, y) = T(x, y) + R(x, y),$ 8 (Huang et al., 3 Mar 2025)	25.96	0.912	9.66M params, lightweight
F2T2-HiT (Cai et al., 5 Jun 2025)	25.57	0.894	FFT+HiT transformer backbone
PromptRR (Wang et al., 2024)	24.04*	—	Diffusion prompt, transformer
RRFormer UHD (Zhang et al., 2023)	24.71	0.971	4K images, transformer backbone
SIRR-LMM (Guo et al., 12 Jan 2026)	26.2	0.913	Large-Multimodal, LoRA-adapted

*Note: for PromptRR, PSNR refers to best ablation; main results in text.

Region-adaptive, intensity-guided, and mask-aware networks (RPEN+PRRN (Han et al., 2023), MaxRF (Zhu et al., 2023)) extend SIRR to scenes with spatially varying reflection strength and complex spatial structures, setting new state-of-the-art on CDR, RRW, and SIR $I(x, y) = T(x, y) + R(x, y),$ 9.

7. Limitations, Open Challenges, and Future Directions

Open challenges persist in SIRR:

Strong, large-area, or occluding reflections remain fundamentally ambiguous; the optimal trade-off between reflection suppression and background fidelity is under debate.
UHD and extremely-high-resolution scenes present computational and memory bottlenecks for existing architectures; sparse/dilated attention and efficient transformer kernels are active research areas (Zhang et al., 2023).
Domain adaptation and out-of-distribution generalization: explicit domain-expert ensembles and attention-weighted fusion outperform joint training but at a computational cost (Liu et al., 2022).
Data and evaluation: High-fidelity, pixel-aligned, large-scale real datasets (OpenRR-5k, RRW) are essential but expensive to curate; artifact-free collection and standardized evaluation suites are critical for fair comparison.
Multimodal cues (depth, polarization, flash/no-flash, text guidance) and physics-based priors (path tracing, spectral rendering) are promising but require more dataset support (Guo et al., 12 Jan 2026, Elnenaey et al., 2024).
Real-time and on-device deployment is not universally achieved, as larger foundation and diffusion models often entail high computational cost at inference (Zakarin et al., 4 Dec 2025).

Future research will likely focus on scalable foundation models, unsupervised and data-efficient learning, plug-in physical priors, multimodal integration, and open benchmarking (Yang et al., 12 Feb 2025). Standardized, modular evaluation platforms are called for to synthesize progress across algorithm designs and datasets.

In summary, single-image reflection removal has evolved from heuristic, hand-crafted pipelines toward an overview of data-driven, physically inspired, and foundation-scale learning. Recent innovations—frequency-prompts, diffusion-generated guidance, transformer-driven context fusion, exclusion unfolded optimization, and region-intensity priors—have enabled robust, high-fidelity reflection removal in real-world, cross-domain, and ultra-high-definition contexts, with continued advances expected in scalability, generalization, and interpretability (Yang et al., 12 Feb 2025, Guo et al., 12 Jan 2026, Wang et al., 2024, Zhu et al., 2023, Zhang et al., 2023, Huang et al., 3 Mar 2025).