Papers
Topics
Authors
Recent
Search
2000 character limit reached

SAR-W-MixMAE Pretraining

Updated 29 January 2026
  • The paper introduces a backscatter-weighted masked autoencoder that minimizes speckle noise by emphasizing low-backscatter regions for improved SAR representation.
  • The method adapts the MixMAE framework with a mixing and masking strategy and a physically-informed exponential weighting based on normalized backscatter power.
  • Results show enhanced performance in multi-label classification and flood detection, with significant gains in F1 score and recall over standard approaches.

SAR-W-MixMAE self-supervised pretraining is a methodology for learning foundation models on Synthetic Aperture Radar (SAR) imagery by incorporating a physically-informed, backscatter-weighted loss into the masked auto-encoding paradigm. Building upon the MixMAE architecture, SAR-W-MixMAE specifically tailors the reconstruction loss to emphasize low-backscatter regions, such as smooth water surfaces, thereby mitigating the impact of speckle noise and extreme backscatter values prevalent in SAR data. This design addresses inherent differences between SAR and optical imaging and advances the effectiveness of foundation model training for SAR-specific downstream tasks (Caglayan et al., 3 Mar 2025).

1. Background and Motivation

Self-supervised pretraining with masked auto-encoders (MAEs) has demonstrated success in natural (optical) images, but its application to SAR is complicated by several factors: significant speckle noise, non-Gaussian intensity distributions, and the scarcity of annotated SAR datasets. Earlier work for SAR (e.g., SAR-MAE (Pu et al., 20 Jan 2025)) adapted the vanilla MAE—masking image patches and learning to reconstruct them—but used conventional pixel-wise mean squared error (MSE) losses, failing to distinguish between high-variance (speckle-dominated) and low-signal (smooth) regions.

SAR-W-MixMAE introduces weighting into the loss function based on per-pixel backscatter power, reducing the influence of unreliable, noisy regions. This approach is motivated by the observation that speckle in SAR is multiplicative and its variance scales with mean backscatter, making naïve loss functions suboptimal for representation learning (Caglayan et al., 3 Mar 2025).

2. Model Architecture and Training Methodology

SAR-W-MixMAE employs a “mixed-and-masked” auto-encoder structure, specifically extending the MixMAE technique to two-channel SAR data (VH and VV polarizations in decibels). Key architectural components are:

  • Encoder: Hierarchical Swin Transformer (“SwinB”) with four stages (channel widths: [128, 256, 512, 1024]; attention heads: [4, 8, 16, 32]; blocks per stage: [2, 2, 18, 2]; window sizes adapted to (8,8,8,4)) to process 128×128128 \times 128 SAR chips.
  • Mixing and Masking Strategy: For each training step, two SAR images x1x_1 and x2x_2 are sampled. A binary mask mm (typically 8×8=648 \times 8 = 64 patches of 16×1616 \times 16 pixels each) randomly selects three-quarters of the patches for mixing. Masked patches in x1x_1 are replaced by the corresponding unmasked patches from x2x_2, and vice versa.
  • Decoder: Lightweight vanilla-transformer decoder reconstructs the two original, unmixed images from the encoded mixed representation.

The network is trained to “unmix” the blended patches and reconstruct both input images, aligning with MixMAE’s methodology while adapting it to SAR data characteristics.

3. Backscatter Power Weighting and Loss Function

The distinguishing innovation of SAR-W-MixMAE lies in its intensity-weighted patchwise loss. The steps are as follows:

  1. Linear Power Computation: For each pixel and polarization p{VH,VV}p \in \{\text{VH}, \text{VV}\},

σp0(linear)=10σp0(dB)/10\sigma^0_p(\mathrm{linear}) = 10^{\sigma^0_p(\mathrm{dB}) / 10}

  1. Per-pixel Average and Normalization: Compute mean linear backscatter, min–max-normalized to [0,1][0, 1]:

norm(σavg0)=σavg0minmaxmin\mathrm{norm}(\sigma^0_{\mathrm{avg}}) = \frac{\sigma^0_{\mathrm{avg}} - \min}{\max - \min}

where σavg0=12(σVH0+σVV0)\sigma^0_{\mathrm{avg}} = \tfrac{1}{2}(\sigma^0_{\mathrm{VH}} + \sigma^0_{\mathrm{VV}}).

  1. Exponential Weight Map: For each pixel,

WSAR=exp(1.0norm(σavg0))W_{\mathrm{SAR}} = \exp(1.0 - \mathrm{norm}(\sigma^0_{\mathrm{avg}}))

Assigns lower weight to high-backscatter, speckle-dominated pixels; higher weight to low-backscatter ones.

  1. Weighted Loss: For NN patches per image,

LSAR-W-MixMAE=1Nn=1NWSARn[(t^1nt1n)2(1mn)+(t^2nt2n)2mn]L_{\text{SAR-W-MixMAE}} = \frac{1}{N} \sum_{n=1}^N W_{\mathrm{SAR}}^n\left[ (\hat{t}_1^n - t_1^n)^2 (1 - m_n) + (\hat{t}_2^n - t_2^n)^2 m_n \right]

where mn{0,1}m_n \in \{0,1\} indicates which source’s patch was visible at nn.

This physically-motivated weighting scheme prioritizes reconstruction accuracy for regions where SAR signals are more reliable and relevant for downstream geophysical inference, particularly flood mapping and surface condition analysis (Caglayan et al., 3 Mar 2025).

4. Pretraining Protocol and Data

SAR-W-MixMAE was pretrained on SAR data from BigEarthNet v1.0, comprising 590,326 patch pairs (Sentinel-1 SAR and optical), using only the SAR channels. The SAR input consists of two bands (VH, VV) in dB; prior to training, bands are normalized per patch (zero mean, unit variance), and backscatter power is converted to linear units for loss weighting.

  • Patch/Chip Size: 16×1616 \times 16 patches from 128×128128 \times 128 pixel images.
  • Augmentation: None beyond MixMAE’s random image pairing, patch mixing, and per-band normalization.
  • Training Schedule: 64 epochs (cyclic scheduler to zero), 40-epoch warm-up, AdamW optimizer (lr 1×1031\times10^{-3}), default MixMAE batch size and weight decay.

Longer pretraining did not yield further downstream gains, indicating diminished returns for extended epochs in this regime.

5. Downstream Applications and Results

The pretrained encoder was evaluated through two downstream tasks:

(a) Multi-label Classification (BigEarthNet v1.0)

A frozen SwinB encoder was compared to random initialization and to MixMAE pretraining:

Pretraining Macro-AP Macro-F1
Random Init (SwinB) 0.6107 0.4936
MixMAE 0.7044 0.6010
SAR-W-MixMAE 0.7088 0.6068

(b) Flood Detection (SEN12-FLOOD)

Binary flood/no-flood classification (with feature differencing of pre- and post-flood patches):

Pretraining Accuracy Precision Recall F1
Random Init 0.8074 0.7647 0.6903 0.7256
MixMAE baseline 0.8468 0.7279 0.8761 0.7952
SAR-W-MixMAE 0.8667 0.7727 0.9027 0.8327

SAR-W-MixMAE exceeded the MixMAE baseline by +3.75 F1 points and improved recall for flooded classes to 0.9027, validating the benefit of emphasizing low-backscatter, smooth areas for water-related change detection.

6. Rationale, Empirical Assessment, and Limitations

The rationale for intensity weighting stems from SAR’s multiplicative speckle noise, whose variance increases with signal, making high-backscatter regions poor anchors for pretraining. By down-weighting these regions in the loss, training emphasizes the reliable recovery of smooth, low-signal areas, critical for environmental monitoring.

Empirical results showed that increasing pretraining epochs beyond 64 did not materially improve performance. Mask ratio and specific forms of weighting were not exhaustively optimized; the exponential weighting on normalized linear power captured the optimal balance for robustness to speckle and sensitivity. Exploration of alternate or adaptive weighting strategies for future work was acknowledged (Caglayan et al., 3 Mar 2025).

A plausible implication is that this approach can be adapted to other SAR modalities (e.g., fully polarimetric or interferometric data) or other geophysical change detection settings.

7. Significance and Relation to Other SAR Foundation Models

SAR-W-MixMAE advances SAR foundation model training beyond earlier pixel-loss MAEs by directly incorporating physical statistics of SAR backscatter into the learning objective. While prior works such as "Enhancing SAR Object Detection with Self-Supervised Pre-training on Masked Auto-Encoders" (Pu et al., 20 Jan 2025) demonstrated that domain-matched pretraining yields superior representation learning (e.g., +1.3 mAP versus ImageNet pretraining for object detection), SAR-W-MixMAE further adapts the training process to the peculiar noise properties and semantic cues in SAR data.

In summary, SAR-W-MixMAE’s weighted masking and mixing strategy, physically-grounded loss, and demonstrated gains on both multiclass and event-detection downstream benchmarks provide a reproducible foundation for robust self-supervised representation learning in SAR remote sensing.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAR-W-MixMAE Self-Supervised Pretraining.