- The paper introduces AquaDiff, a diffusion-based framework leveraging cross-attention with physics-inspired chromatic priors for underwater image enhancement.
- It employs a U-Net variant enriched with residual dense blocks and multi-resolution attention to effectively suppress artifacts and recover fine structural details.
- Experimental evaluations show superior UCIQE scores and competitive PSNR/SSIM metrics, demonstrating robust mitigation of wavelength-dependent color distortions.
Diffusion-Based Underwater Image Enhancement with AquaDiff
Introduction
AquaDiff introduces a conditional diffusion-based approach for underwater image enhancement, specifically targeting the mitigation of wavelength-dependent color distortion while maintaining perceptual and structural fidelity. Underwater imagery is uniquely challenging due to complex physical phenomena such as selective light attenuation and multi-path scattering, which significantly impair the performance of CV systems. Traditional methods, both model-free and physically inspired, and recent data-driven CNN/GAN-based models each have notable limitations regarding generalization, color fidelity, and artifact suppression under severe and diverse degradation. Diffusion models, leveraging strong generative priors and iterative denoising, offer a compelling foundation for robust enhancement but have yet to be fully adapted for underwater-specific degradations.
AquaDiff Framework
AquaDiff fundamentally employs a DDPM-inspired architecture, integrating unique mechanisms tailored for underwater scenarios. The central framework is depicted in (Figure 1).
Figure 1: Overview of the AquaDiff framework, illustrating the interplay between forward diffusion, chromatic prior conditioning, and reverse denoising via cross-attention.
The forward diffusion adds Gaussian noise to clean reference images over T steps, culminating in highly noisy latent representations. The reverse process leverages a conditional diffusion model, where at each denoising step, the model receives the current noisy image, a chromatic prior-guided conditioning image, and the timestep index. The core denoising backbone is a U-Net variant with three residual dense blocks, rich skip connectivity, and multi-resolution spatial attention, facilitating hierarchical feature extraction and global-local context merging.
Critically, conditioning is performed via cross-attention—eschewing direct concatenation—where degradations in the chromatic prior are dynamically fused with the evolving noisy state. This cross-attention enables timestep-dependent, spatially-selective conditioning, optimally leveraging the color-compensated prior for detail and context recovery at all noise levels.
Chromatic Prior-Guided Conditioning
The chromatic prior is generated using a physics-inspired three-channel color compensation (3C) method, operating in Lab color space to suppress color casts by reconstructing attenuated chromatic channels via spatial masking and Gaussian smoothing. The mask is adaptively generated to avoid overcompensation near highlights and is merged back into the input for cross-attention-based guidance. This preprocessing crucially embeds wavelength attenuation statistics into the model's conditioning signal, aligning the restoration process with the physical characteristics of underwater image formation.
Diffusion and Denoising Network
The forward process introduces progressive, schedule-controlled noise. Sampling at arbitrary timesteps is performed analytically using closed-form marginalization. The reverse process iteratively reconstructs clean images by predicting and removing additive noise, proceeding from the latent Gaussian through cross-attention fusion with the color-compensated prior.
Key architectural elements include:
- Residual Dense Blocks: Enhanced feature propagation and improved gradient flow, facilitating recovery of decayed structural semantics.
- Dense Skip Connections: Inspired by U-Net++, enabling multi-level information sharing and preventing bottleneck artifacts.
- Multi-Resolution Attention: Explicit attention at 16×16 and 32×32 feature maps augments the network's capacity to reconcile large-scale color drift with local texture attenuation.
The model is conditioned on both the noisy latent and the color-compensated prior at every denoising step, refining the enhancement with spatially and temporally adaptive attention.
Cross-Domain Consistency Loss
AquaDiff introduces a cross-domain consistency loss (CDCL) that enforces fidelity in pixel, perceptual, structural, and frequency domains. The loss comprises:
- ℓ1​ Pixel and Multi-Scale Losses: Enforce accurate local- and global-level reconstruction.
- VGG-19 Deep Perceptual Loss: Encourages restoration of high-level semantic consistency.
- SSIM Loss: Preserves luminance, contrast, and structural similarity, mitigating geometric artifacts.
- Frequency-Domain Loss: Enforces recovery of high-frequency components (e.g., edges, textures) typically attenuated under scattering.
The hybrid CDC loss robustly constrains the generative process to suppress diffusion artifacts, suppresses over-smoothing, and incentivizes realistic color and detail recovery.
Experimental Evaluation
Datasets and Implementation
Training utilizes the LSUI (5,004 pairs) and UIEB (800 pairs) datasets, each comprising highly varied underwater imagery. Testing spans TEST-U90 (90 images), U45, S16, and C60 datasets, ensuring generalization across unseen environments and degradations. The model is implemented in PyTorch, uses 2000 diffusion steps, and is trained using Adam for 1 million iterations.
Quantitative Results
Evaluation considers both full-reference (PSNR, SSIM) and no-reference (UIQM, UCIQE) metrics. Comparative analysis involves state-of-the-art traditional, CNN-, GAN-, and diffusion-based methods.
(Figure 2)
Figure 2: Quantitative results for UIQM and UCIQE across major benchmarks, showing top ranking for AquaDiff in chromatic fidelity and competitive/restoration quality.
AquaDiff's improvements are particularly pronounced in UCIQE, reflecting its impact on color balance and perceived naturalness—critical for underwater visual tasks.
Qualitative Results
Qualitative analysis on U90, U45, S16, and C60 datasets reveals that AquaDiff:
- Consistently restores color balance in blue- and green-dominated scenes.
- Effectively removes haze and veiling, recovers structural details even in extreme turbidity.
- Suppresses artifacts common in GAN- and CNN-based outputs (halos, banding, over-enhancement).
(Figure 3)
Figure 3: Visual comparison on U90, demonstrating superior haze removal, artifact suppression, and color recovery by AquaDiff.
Additional results on U45 and S16 corroborate robust performance in scenarios with artificial lighting, strong scattering, and depth-induced chromatic distortion.
Ablation Studies
Systematic ablation reveals:
- Removing the cross-domain consistency loss results in significant degradation in both UIQM and UCIQE.
- Excluding enhanced U-Net blocks or multi-resolution attention consistently lowers structural and color restoration accuracy.
(Table 1)
| Model Variant |
UIQM |
UCIQE |
| Baseline Diffusion |
4.12 |
0.486 |
| + CDCL Only |
4.38 |
0.521 |
| + Enhanced U-Net Only |
4.45 |
0.528 |
| AquaDiff (Full) |
4.61 |
0.539 |
Table 1: Enhancement contributions of AquaDiff components.
Implications and Future Directions
AquaDiff’s strong performance on challenging, real-world datasets demonstrates the efficacy of integrating physical priors, cross-attention conditioning, and hybrid loss design into the generative diffusion paradigm for underwater image enhancement. The method establishes the value of explicit architectural and loss-driven biases against hydro-optical distortions and suggests broad potential for furthering physically-informed diffusion models in other low-level vision domains (e.g., dehazing, deblurring).
Practical implications are clear for real-time underwater robotics, object detection, SLAM, and 3D mapping—where enhanced input fidelity directly affects downstream algorithm robustness and reliability. The design of AquaDiff can inform future work in multi-modal diffusion conditioning, color/frequency domain regularization, and architecture adaptation for deployment efficiency (e.g., fast sampling, reduced resolutions).
Potential extensions include joint enhancement-task adaptation (e.g., simultaneous image enhancement and detection), self-supervised adaptation to unseen underwater environments, and multi-sensor fusion for domain transfer.
Conclusion
AquaDiff presents a technically rigorous, physically-guided, and empirically validated framework for underwater image enhancement, demonstrating dominance in color fidelity and competitive overall quality. By leveraging cross-attention conditioning via chromatic priors, residual-attention U-Net backbones, and cross-domain consistency losses, it advances the application of diffusion models to complex, real-world vision enhancement tasks. The results further solidify the role of diffusion-based architectures in mission-critical underwater visual applications and provide a foundation for future research in this direction.