Swin-BathyUNet: Transformer U-Net for Bathymetry
- The paper introduces Swin-BathyUNet, a hybrid U-Net and Swin Transformer model integrating cross-attention to fuse RGB imagery with DSM data for precise bathymetry mapping.
- It employs a multi-scale architecture with embedded Swin Transformer blocks and a boundary-sensitive RMSE loss to enhance global feature integration and gap reconstruction.
- Empirical results on coastal datasets demonstrate significant improvements in mapping accuracy and coverage, reducing RMSE compared to traditional SfM-MVS and U-Net baselines.
Swin-BathyUNet is a deep learning architecture for remote sensing-based bathymetry mapping, specifically designed to improve the accuracy, detail, and spatial completeness of bathymetric maps derived from airborne or satellite optical imagery. Developed as a hybrid paradigm that combines a U-Net convolutional backbone with Swin Transformer self-attention and a cross-attention mechanism, Swin-BathyUNet enables the fusion of spectral cues from RGB imagery with three-dimensional priors from Structure-from-Motion Multi-View Stereo (SfM-MVS) derived Digital Surface Models (DSMs) that may contain significant data gaps. The model is tailored for Spectrally Derived Bathymetry (SDB) without requiring in-situ depth measurements, and delivers robust performance across structurally and visually diverse coastal environments (Agrafiotis et al., 15 Apr 2025).
1. Architectural Composition and Information Flow
Swin-BathyUNet employs a classical U-Net encoder–decoder topology as the foundational structure. The encoder incrementally reduces spatial resolution through max-pooling while expanding feature dimensionality, consisting of four stages with canonical 3×3 convolutional blocks (ReLU activation) and 2×2 max-pooling. Feature channel depth progresses as 64 → 128 → 256 → 512. This is followed by a ‘center’ module (two 3×3 convolutions expanding to 1024 channels) that acts as the bottleneck.
The decoder path symmetrically upsamples spatial resolution via nearest neighbor interpolation, with each stage concatenating the corresponding skip features before applying two 3×3 convolutions and halving the channel dimension (1024 → 512 → 256 → 128 → 64). The final segmentation head is a 1×1 convolution projecting 64 channels to a single-channel continuous depth map.
Distinctively, three Swin Transformer blocks are inserted at progressively finer scales in the skip pathway: after encoder stages at H/16, H/8, and H/4 resolution, utilizing embedding dimensions D=512, 256, and 128 respectively. Outputs from these Swin blocks are spatially upsampled to match current decoder resolution and concatenated, thereby enriching local convolutional features with global and long-range dependencies (Agrafiotis et al., 15 Apr 2025).
The structure of a single Swin Transformer block is:
- Window-based Multi-Head Self-Attention (W-MSA) with window size 64×64, using 8 heads and per-head dimensionality d_k = D/8.
- Cross-Attention (CA) layer (detailed in Section 2).
- Two-layer MLP (hidden dimension 4D, ReLU nonlinearity).
- Layer normalization pre-activation and residual skip connections for each sub-layer.
Block computations, for input :
with self-attention as
2. Cross-Attention Mechanism for Spectral–Depth Fusion
The Swin-BathyUNet introduces a hierarchical cross-attention (CA) mechanism to integrate low-level DSM-derived features with high-level RGB-derived features. The CA layer is embedded within each Swin Transformer block and enables the model to utilize geometric 3D cues to enhance interpretation of spectral content, crucial for filling DSM data gaps and reducing ambiguities in homogeneous-seeming seabed types.
The CA operates by using current main-stream feature representations as queries, while keys and values derive from auxiliary SfM-MVS DSM features:
where are the feature tensors from decoder and encoder, respectively; . Cross-attention is then computed identically to self-attention but facilitates explicit fusion between data modalities.
This design supports leveraging sparse, refraction-corrected DSM cues that may be incomplete or noisy due to textureless seabed regions, while prioritizing reliable spatial relationships through the attention mechanism (Agrafiotis et al., 15 Apr 2025).
3. Data Representation and Modal Input Processing
Swin-BathyUNet’s input tensor is a concatenation of co-registered multi-channel orthophotos (spectral RGB patches) and DSM patches from the SfM-MVS pipeline. RGB images are normalized to (by dividing by 255), and DSM values are rescaled by dividing by the site-specific maximum depth (e.g., m for Agia Napa, m for Puck Lagoon).
The DSM input frequently contains gaps due to the limitations of SfM-MVS and environmental conditions. During training, these gaps are masked; DSM information flows exclusively via the keys and values in cross-attention. The final output is a single-channel map at the same spatial resolution as input (720 × 720 pixels, ~0.25 m ground sampling distance), representing bathymetric depth (Agrafiotis et al., 15 Apr 2025).
4. Objective Functions and Optimization Strategy
To address the challenge posed by missing DSM regions, Swin-BathyUNet incorporates a boundary-sensitive weighted RMSE (BSW-RMSE) loss. Pixelwise weights are assigned based on the (clipped) Euclidean distance from each pixel to the nearest DSM gap boundary:
$w_i = 1 - \frac{D_i^\text{clipped} - D_\min}{D_\max - D_\min}$
where $D_\min$ and $D_\max$ define the range for linear decay. The BSW-RMSE over the output depth map is:
where is a mask for permanently missing DSM areas excluded from training loss (Agrafiotis et al., 15 Apr 2025).
Alternative ablation configurations employ standard RMSE (constant ) or exponential decay. The boundary-sensitive loss was found critical for guiding the model to reliable generalization in gap regions, without explicit L1 or spatial smoothness regularization.
Training employs the Adam optimizer. For Agia Napa, 30 epochs with learning rate ; for Puck Lagoon, 60 epochs with , using a cosine-annealing schedule. Data augmentation includes randomized 90° rotations, horizontal/vertical flips, and sun-glint removal. Dropout is set to 0.1 throughout (Agrafiotis et al., 15 Apr 2025).
5. Empirical Evaluation and Quantitative Outcomes
Swin-BathyUNet was evaluated on two diverse coastal datasets:
- Agia Napa (Mediterranean): 21 patches (720×720 px each, covering 0.7 km², depths to –15.5 m) with Leica HawkEye III LiDAR ground truth.
- Puck Lagoon (Baltic): 2019 patches (65.4 km², up to –5.8 m depth) with dual LiDAR and MBES ground truth.
Key quantitative findings:
- Raw SfM-MVS (with refraction correction SVR) yields RMSE ≈ 1.96 m (An), 1.01 m (PL); correction alone improves this to 0.38 m (An), 0.13 m (PL).
- Swin-BathyUNet (BSW loss) further reduces RMSE to 0.49 m (An) and 0.16 m (PL), improving over a strong U-Net baseline by ~27%.
- Coverage gain over SfM-MVS: +43.5% (An), +12.7% (PL), yielding full map coverage.
- Removing cross-attention increases RMSE by ~7%, and reducing Swin block depth from three to one degrades RMSE by ~6%. Decreasing window size or number of attention heads similarly harms performance.
Qualitatively, Swin-BathyUNet repairs bathymetric gaps, suppresses noise, and reconstructs fine seabed features at 0.25 m pixel size, outperforming the spatial resolution of ground truth acquisition modalities (Agrafiotis et al., 15 Apr 2025).
6. Significance and Distinctions from Prior Architectures
Swin-BathyUNet represents a novel synthesis of convolutional and transformer-based paradigms for remote sensing-based depth reconstruction. Unlike fully convolutional baselines, which are challenged by the locality of convolution and limited context for gap filling, or pure transformer architectures such as Swin-Unet (which use hierarchical Swin blocks in both encoder and decoder) (Cao et al., 2021), Swin-BathyUNet preserves efficient U-Net local encoding/decoding while spatially enriching skip features with self- and cross-attention-driven global context.
The introduction of cross-attention enables effective exploitation of irregular, incomplete three-dimensional priors, a setting typical for SDB with DSMs, and shows consistent quantitative and qualitative superiority over approaches that either neglect spatial gaps or require intensive field calibration or manual gap filling. Ablation results underscore the necessity of both the cross-attention design and multi-scale Swin block integration for optimal performance (Agrafiotis et al., 15 Apr 2025).
7. Application Scope, Limitations, and Prospects
Swin-BathyUNet is architecture-agnostic beyond its demonstrated domain (coastal bathymetry), since its cross-modal attention mechanism and boundary-sensitive loss do not rely on fixed input modalities. While benchmarks are limited to two representative but diverse coastal regions, the approach is generalizable wherever fused spectral–geometric information is available, provided sufficient training data and careful normalization. The method alleviates the dependency on in-situ depths that characterizes many SDB pipelines and achieves high-resolution, noise-suppressed predictions suitable for operational seabed monitoring.
This suggests future research may explore its applicability to other modalities and tasks suffering from spatial incompleteness or data gaps, as well as scaling to larger or higher-resolution imagery through optimized Swin attention strategies. Model code and pretrained weights are available publicly (Agrafiotis et al., 15 Apr 2025).