Semi-supervised Multiscale Matching for SAR-Optical Image

Published 11 Aug 2025 in cs.CV | (2508.07812v1)

Abstract: Driven by the complementary nature of optical and synthetic aperture radar (SAR) images, SAR-optical image matching has garnered significant interest. Most existing SAR-optical image matching methods aim to capture effective matching features by employing the supervision of pixel-level matched correspondences within SAR-optical image pairs, which, however, suffers from time-consuming and complex manual annotation, making it difficult to collect sufficient labeled SAR-optical image pairs. To handle this, we design a semi-supervised SAR-optical image matching pipeline that leverages both scarce labeled and abundant unlabeled image pairs and propose a semi-supervised multiscale matching for SAR-optical image matching (S2M2-SAR). Specifically, we pseudo-label those unlabeled SAR-optical image pairs with pseudo ground-truth similarity heatmaps by combining both deep and shallow level matching results, and train the matching model by employing labeled and pseudo-labeled similarity heatmaps. In addition, we introduce a cross-modal feature enhancement module trained using a cross-modality mutual independence loss, which requires no ground-truth labels. This unsupervised objective promotes the separation of modality-shared and modality-specific features by encouraging statistical independence between them, enabling effective feature disentanglement across optical and SAR modalities. To evaluate the effectiveness of S2M2-SAR, we compare it with existing competitors on benchmark datasets. Experimental results demonstrate that S2M2-SAR not only surpasses existing semi-supervised methods but also achieves performance competitive with fully supervised SOTA methods, demonstrating its efficiency and practical potential.

Abstract PDF Upgrade to Chat

Summary

The paper introduces S2M2-SAR, a novel semi-supervised method that fuses deep semantic and shallow spatial features for improved SAR-optical alignment.
It employs an attention-based cross-modal feature enhancement module with pseudo-labeling to reduce labeling costs and boost matching accuracy.
Experimental validation on SEN1-2 and QXS-SAROPT datasets demonstrates superior matching performance and lower RMSE compared to fully supervised methods under minimal supervision.

Semi-supervised Multiscale Matching for SAR-Optical Image: A Comprehensive Analysis

Problem Context and Challenges

Matching synthetic aperture radar (SAR) images with optical imagery is fundamental for multimodal remote sensing tasks including data fusion, geo-localization, and environmental monitoring. The intrinsic discrepancy in appearance, radiometry, and geometry—owing to the distinct imaging mechanisms of SAR and optical sensors—renders traditional mono-modal registration approaches ineffective. While SOTA deep learning methods increase matching performance by extracting modality-invariant features, their success heavily relies on extensive labeled correspondences, which are labor-intensive to annotate and infeasible to acquire at scale for diverse sensor types and scenarios.

Overview of S $^2$ M $^2$ -SAR

The paper introduces S $^2$ M $^2$ -SAR, a semi-supervised multiscale matching architecture. The method is designed to minimize supervision requirements by leveraging both labeled and abundant unlabeled SAR-optical image pairs while robustifying matching through multi-level feature integration. The framework integrates a Siamese ResNetFPN backbone, cross-modal feature enhancement, and a multiscale pseudo-labeling pipeline.

Key methodological innovations include:

Multiscale Matching: Jointly exploits deep-level (robust, semantic, low-resolution) and shallow-level (fine, spatial, high-resolution) features to generate similarity heatmaps, fusing their complementary strengths for pseudo-label construction, as illustrated in the combination of deep-level and shallow-level similarity heatmaps.
Figure 1: Combining deep-level (robust but low-resolution) and shallow-level (high-resolution but prone to mismatches) similarity heatmaps.
Cross-modal Feature Enhancement Module: Employs a stackable block with self-attention for suppressing modality-specific information and cross-modality linear attention for enhancing shared features. It is regularized via a mutual independence loss, promoting disentanglement between shared and specific features without requiring ground-truth supervision.
Semi-supervised Learning with Pseudo-labels: For unlabeled data, pseudo ground-truth heatmaps are generated by merging deep- and shallow-level similarity maps. Model optimization uses a hybrid of supervised (cross-entropy, mutual independence) and unsupervised (pseudo-label cross-entropy, mutual independence) losses, thus maximizing the benefit of unlabeled data and the resilience to annotation errors.

Methodological Details

The Siamese ResNetFPN backbone extracts hierarchical features at both the original and subsampled resolutions. By omitting initial downsampling and introducing lateral pyramid connections, the system ensures shallow-level features maintain pixelwise accuracy, while deep features encode semantic context and spatial robustness.

Matching is operationalized via FFT-accelerated NCC, producing two heatmaps per image pair: $M^d$ from deep maps, $M^s$ from shallow maps. For inference and pseudo-label generation, upsampled deep heatmaps are elementwise-multiplied with shallow heatmaps, then normalized, enforcing consistency across scales.

The attention-based feature enhancement block consists of sequential self- and cross-attention layers with linear complexity, each followed by multiscale convolution, forming an efficient architecture for representation disentanglement.

Mutual information minimization between the extracted shared and modality-specific feature spaces serves as an unsupervised regularizer. This theoretical formulation is explicitly realized via the calculation of discrete mutual information on flattened feature maps, summed across SAR and optical modalities.

Experimental Validation

The proposed method is benchmarked on the large-scale SEN1-2 dataset and the QXS-SAROPT dataset, using a minimal supervision regime (6.25% labeled data). Quantitative evaluation on CMR (T=1, T=5), RMSE, and inference time demonstrates:

Superior correct matching rates and lower RMSEs compared to both fully supervised (100% labeled) and semi-supervised baselines, despite significantly reduced supervision.
Robustness to data scarcity, with high performance maintained as the proportion of labeled data decreases.

Comprehensive ablation experiments reveal:

Cross-modal feature enhancement, even with a single block, significantly boosts matching performance.
Combining multiscale matching with semi-supervised losses stabilizes the system and enhances pseudo-label quality.
The number of enhancement blocks trades off between accuracy (particularly high-precision, low-threshold matching) and computational cost.

Analysis of Pseudo-labels and Label Scarcity

The pseudo-labels serve as a denoising and regularization mechanism, as evidenced by their superior RMSE and lower FMR relative to direct shallow-level matching during early training. As the network converges, the gap narrows, confirming their effectiveness for semi-supervised SAR-optical correspondence learning.

Experiments varying the labeled-unlabeled batch ratio highlight the scalability and adaptability of the approach: even few labeled samples suffice for competitive performance, and accuracy improves monotonically with additional annotation.

Practical and Theoretical Implications

Practically, S $^2$ M $^2$ -SAR drastically reduces annotation dependence for multimodal matching, facilitating rapid scaling to new SAR-optical data sources and sensor domains. Its use of unsupervised regularization and pseudo-label bootstrapping makes it robust to noisy or biased ground-truth, addressing a critical bottleneck in real-world deployments.

Theoretically, the integration of cross-modal mutual independence loss aligns with modern understanding of disentangled representation learning and cross-domain correspondence, offering a general template for semi-supervised multimodal learning. The architecture's emphasis on linear attention mechanisms also points toward computationally efficient scaling for dense remote sensing workloads.

The method's structure enables extensibility: more advanced backbones, alternative feature disentanglement losses, and self-supervised auxiliary objectives can be seamlessly integrated. This positions S $^2$ M $^2$ -SAR as a state-of-the-art paradigm for semi-supervised multimodal remote sensing correspondence, with implications for general cross-domain matching tasks in computer vision.

Conclusion

S $^2$ M $^2$ -SAR demonstrates that semi-supervised, multiscale feature integration, coupled with attention-based disentanglement and robust pseudo-labeling, can achieve performance competitive with or superior to fully supervised approaches for SAR-optical matching—even with minimal labeled data. The framework is computationally efficient, empirically sound, and theoretically grounded. Future research directions include fully unsupervised extensions, exploration of advanced feature backbones, and application of the architecture to other heterogeneous matching problems in vision, such as multimodal medical image registration and general remote sensing data fusion.