Multimodal Remote Sensing Datasets
- Multimodal remote sensing datasets are standardized archives combining diverse sensor modalities (e.g., optical, SAR, LiDAR) to evaluate ML models reproducibly.
- They incorporate co-registered imagery, non-imagery data, and detailed annotations to support tasks like classification, detection, segmentation, and retrieval.
- These benchmarks enable advanced fusion techniques and cross-modal reasoning, facilitating applications in disaster monitoring, climate analysis, and geolocation.
Multimodal remote sensing benchmark datasets comprise standardized public archives spanning two or more distinct sensor modalities (e.g., optical, SAR, multispectral, hyperspectral, infrared, LiDAR, maps, or textual descriptions) that support rigorous, reproducible evaluation of machine learning models for remote sensing tasks. These resources have become foundational across land cover mapping, object detection, geocoding, cross-modal retrieval, generative modeling, and advanced multimodal LLM (MLLM) research. Their design increasingly reflects the spectrum of contemporary sensing platforms, tasks, and cross-modal reasoning challenges present in real-world geoscience and Earth observation.
1. Sensor Modalities and Dataset Composition
Recent benchmark datasets exhibit multimodality at different levels of spatial, spectral, and semantic granularity:
- Core image modalities: High-resolution RGB (satellite, aerial, drone), synthetic aperture radar (SAR), multispectral (Sentinel-2, SPOT-6, WorldView), near-infrared (NIR), panchromatic (PAN), and, in advanced sets, hyperspectral or digital surface models (DSM/LiDAR) (Sumbul et al., 2021, Agrafiotis et al., 2024, Li et al., 2024, Xiong et al., 2023, Fan et al., 4 Aug 2025, Zou et al., 27 Nov 2025).
- Temporal and environmental diversity: Some benchmarks collect time series to capture event progression (e.g., disaster monitoring in MONITRS) or variation in illumination/atmosphere (e.g., fog/snow/night in MMM-RS) (Revankar et al., 22 Jul 2025, Luo et al., 2024).
- Non-imagery modalities: Geospatial coordinates, vector/rasterized map data, ground inventories, and natural language annotations (captions, QA, instructions) are increasingly incorporated, especially for retrieval, reasoning, or cross-modal grounding (Aimar et al., 12 Dec 2025, Wang et al., 31 Mar 2025, Li et al., 2024, Wu et al., 20 Mar 2025, Yang et al., 28 Sep 2025).
- Scale and annotation type: Leading datasets now offer millions of paired samples—BigEarthNet-MM: 590K S1/S2 pairs (Sumbul et al., 2021), MMM-RS: 2.1M multimodal text-image pairs (Luo et al., 2024), SAR-GEOVL-1M: 120K SAR tiles + >950K text segments (Yang et al., 28 Sep 2025), TalloS: 190K scenes with genus-level global tree labels (Bountos et al., 2023).
Modalities are typically co-registered to enable pixel-, region-, or object-level fusion, and annotated for semantics (class labels, bounding boxes, segmentation masks, ground-truth coordinates, QA, or change maps).
2. Benchmark Tasks and Evaluation Protocols
Multimodal RS datasets support a broad spectrum of standardized challenges:
- Scene and object classification: Image- or patch-level multi-modal, multi-label classification (e.g., land cover/land use, species, or target categories) (Sumbul et al., 2021, Bountos et al., 2023, Li et al., 2024, Yang et al., 28 Sep 2025).
- Object detection and instance segmentation: Bounding-box, oriented box, or polygon-based detection across modalities (e.g., ships, vehicles, buildings) (Li et al., 2024, Fan et al., 4 Aug 2025, Shermeyer et al., 2020, Li et al., 2024).
- Semantic segmentation: Pixelwise labeling on RGB+height (GAMUS (Xiong et al., 2023)), hyperspectral+SAR/DSM (C2Seg (Hong et al., 2023), S2FL (Hong et al., 2021)), or multi-modal UHR imagery (XLRS-Bench (Wang et al., 31 Mar 2025)).
- Cross-modal retrieval: Retrieval across image-text, image-image, or modal pairs (e.g., SAR-Optical, map-visible, region–caption, or referring expression) (Aimar et al., 12 Dec 2025, Wu et al., 20 Mar 2025, Yang et al., 28 Sep 2025, Li et al., 2024).
- Image and caption generation: Text-to-image (MMM-RS (Luo et al., 2024)), cross-modal translation (RGB↔SAR→PAN, SMART-Ship (Fan et al., 4 Aug 2025)), and pan-sharpening tasks (Fan et al., 4 Aug 2025).
- Change detection and spatiotemporal reasoning: Examples include disaster progression via temporal image sequences (MONITRS (Revankar et al., 22 Jul 2025)), and pixelwise change masks (SMART-Ship (Fan et al., 4 Aug 2025)).
- Advanced VQA and MLLM tasks: Visual question answering, spatial reasoning, multi-hop inference, geocoding, and grounding on UHR imagery (XLRS-Bench (Wang et al., 31 Mar 2025), RSMEB (Aimar et al., 12 Dec 2025), DDFAV/RSPOPE (Li et al., 2024)).
All major benchmarks report task-specific metrics in standardized forms:
- Classification: accuracy, F1, OA, macro/micro averages.
- Detection: mean average precision (mAP) at multiple IoU thresholds, oriented/horizontal box support.
- Segmentation: per-class IoU, mean IoU (mIoU), pixel accuracy.
- Retrieval: Recall@K, Precision@1 (P@1), mean average precision (mAP).
- VQA/QA: accuracy, BLEU/METEOR/ROUGE for captions, closed-form binary or multiple-choice QA.
- Generation: FID, Inception Score, SSIM, PSNR, ERGAS, SAM.
- Tracking/3D tasks: for UAV/MM3D, CLEAR-MOT metrics, ADE/FDE for trajectory prediction (Zou et al., 27 Nov 2025).
3. Data Curation, Preprocessing, and Annotation Strategies
Benchmark integrity and reproducibility depend on rigorous data handling:
- Preprocessing pipelines: Sensor-specific corrections (radiometric, atmospheric, geometric), resampling (e.g., all bands to 10 m), patching or tiling (e.g., 512×512 for MMM-RS (Luo et al., 2024), 120×120 for BigEarthNet-MM (Sumbul et al., 2021)), cloud/shadow masking, and co-registration across modalities.
- Quality control: Manual, hybrid, or LLM-assisted pipelines for cross-modal annotation (e.g., MapData uses manual correspondence, RANSAC, template matching (Wu et al., 20 Mar 2025); XLRS-Bench employs iterative LLM+human captioning (Wang et al., 31 Mar 2025)). Cloud/snow/shadow filtering is often critical for optical.
- Semantic schema: Unified class ontologies (19-class mapping in BigEarthNet-MM (Sumbul et al., 2021), long-tail 61-class ship taxonomy in SMART-Ship (Fan et al., 4 Aug 2025), 13-class LULC for C2Seg (Hong et al., 2023)) or hierarchical/chain-of-thought text (SAR-GEOVL-1M HCoT (Yang et al., 28 Sep 2025)).
- Annotation formats: JSON, CSV, GeoTIFF, or MATLAB; typically standardized, with provided code/dataloaders for splits and tasks.
- Official splits: Train/val/test partitions designed to control for spatial, temporal, and semantic leakage (e.g., event-level split in MONITRS (Revankar et al., 22 Jul 2025), patch/city split in C2Seg (Hong et al., 2023)).
4. Notable Datasets and Comparative Landscape
A non-exhaustive list of leading multimodal RS datasets, with task, modalities, and scale:
| Dataset | Modalities | Tasks | Scale/Notes | Reference |
|---|---|---|---|---|
| BigEarthNet-MM | Sentinel-1 (SAR), S2 (MSI) | Multilabel classification, retrieval | 590K pairs | (Sumbul et al., 2021) |
| MONITRS | Sentinel-2 RGB, text/news QA | Disaster tracking, temporal VQA | 10K events | (Revankar et al., 22 Jul 2025) |
| SMART-Ship | RGB, SAR, PAN, NIR, MS | Detection, Re-ID, change, pan-sharpening, generation | 1K scenes/38K ships | (Fan et al., 4 Aug 2025) |
| MMM-RS | RGB, SAR, NIR, text | Text-to-image, cross-modal generation | 2.1M samples | (Luo et al., 2024) |
| RSMEB/VLM2GeoVec | RGB, text, coords, boxes | Classification, retrieval, geo-localization, VQA | 205K queries | (Aimar et al., 12 Dec 2025) |
| SpaceNet 6 | SAR, optical (EO) | Instance segmentation (buildings) | 48K polygons | (Shermeyer et al., 2020) |
| MapData/MapGlue | Map (raster), visible image | Image matching | 121K pairs | (Wu et al., 20 Mar 2025) |
| XLRS-Bench | UHR panchromatic, text | Reasoning, grounding, VQA | 1400 images | (Wang et al., 31 Mar 2025) |
| C2Seg | HS, MS, SAR | Cross-city semantic segmentation | 4 images, multi-M | (Hong et al., 2023) |
| MagicBathyNet | S2, SPOT-6, aerial, DSM | Bathymetry, seabed segmentation | 3K triplets | (Agrafiotis et al., 2024) |
| UAV-MM3D | RGB, IR, DVS, LiDAR, Radar | 3D detection, trajectory, tracking | 400K frames | (Zou et al., 27 Nov 2025) |
Key trends are the scaling up of sample size (multi-million pairs), incorporation of temporal and multi-scene coverage, and standardization of API/code for task benchmarking.
5. Advances in Fusion Approaches and Modal-Specific Protocols
Benchmarks enable the study of fusion strategies and cross-modal learning:
- Early, feature, and late fusion: Early channel stacking, mid-level concatenation, or late ensemble/logit fusion (GAMUS (Xiong et al., 2023), S2FL (Hong et al., 2021), SMART-Ship (Fan et al., 4 Aug 2025)).
- Transformer/attention architectures: Intermediary fusion tokens (TIMF in GAMUS), cross-attention, and multi-head strategies dominate recent segmentation and detection systems (Xiong et al., 2023, Li et al., 2024).
- Grid-level and sparse MoE backbones: Grid-level sparse mixture-of-experts (SM3Det (Li et al., 2024)) facilitate dynamic allocation of model capacity to each modality/task.
- Alignment-specific protocols: Homography/matching benchmarks (MapData), temporal sequence fusion (MONITRS), cross-spectral translation baselines (CycleGAN, ControlNet (Luo et al., 2024)).
- Evaluation stratified by modality: Distinct metrics for each source (e.g., HBB vs. OBB for SAR/optical/IR in SOI-Det (Li et al., 2024)), ablation studies, and domain gap analyses.
6. Applications, Limitations, and Future Benchmark Directions
Benchmarks unlock research in:
- Disaster monitoring: Rapid assessment from text+image, temporal progression (MONITRS (Revankar et al., 22 Jul 2025)).
- Climate and forestry: Global-scale tree taxonomy (TalloS (Bountos et al., 2023)), forest change, and taxonomy-aware modeling.
- Geolocation and cross-modal retrieval: Embedding tasks with text, coordinates, or referring expressions (RSMEB (Aimar et al., 12 Dec 2025), SAR-KnowLIP (Yang et al., 28 Sep 2025), MapGlue (Wu et al., 20 Mar 2025)).
- Foundation model pretraining/evaluation: Universal encoders for any modal-combination (FoMo-Bench/FoMo-Net (Bountos et al., 2023), SAR-KnowLIP (Yang et al., 28 Sep 2025)), including self-supervised and MAE objectives.
- Temporally aware, instruction-driven multimodal LLMs: Spatiotemporal reasoning on UHR, prompt/QA tasks (XLRS-Bench (Wang et al., 31 Mar 2025), DDFAV/RSPOPE (Li et al., 2024), MGIMM (Yang et al., 2024)).
Limitations observed:
- Modal coverage gaps: Not all combinations (e.g., SAR+NIR+HS+DSM+LiDAR) are publicly available; NIR and HS often underrepresented except in benchmark releases (e.g., GAMUS (Xiong et al., 2023), MagicBathyNet (Agrafiotis et al., 2024), C2Seg (Hong et al., 2023)).
- Single-region or short-timescale coverage: Many datasets are single-epoch or fixed geography.
- Annotation/label drift: Inconsistent label schema, temporal mismatch between RS and ground/societal sources.
- Long-tailed class/semantic imbalance: Prominent for rare micro-objects or fine-grained taxonomies (e.g., SMART-Ship, TalloS).
Anticipated directions include: global temporal archives, multi-resolution and multi-domain expansion, more realistic multi-weather/atmosphere conditions, richer QA/captioning with formal correctness checks, and direct support for 3D/UAV perceptual tasks.
7. References and Data Accessibility
Most benchmarks provide public access to data, pre-trained models, training/evaluation code, and detailed documentation:
- BigEarthNet-MM (Sumbul et al., 2021)
- MONITRS (Revankar et al., 22 Jul 2025)
- MMM-RS (Luo et al., 2024)
- SMART-Ship (Fan et al., 4 Aug 2025)
- SpaceNet 6 (Shermeyer et al., 2020)
- GAMUS (Xiong et al., 2023)
- S2FL/C2Seg (Hong et al., 2021, Hong et al., 2023)
- MagicBathyNet (Agrafiotis et al., 2024)
- UAV-MM3D (Zou et al., 27 Nov 2025)
- RSMEB/VLM2GeoVec (Aimar et al., 12 Dec 2025)
- XLRS-Bench (Wang et al., 31 Mar 2025)
- SAR-KnowLIP (Yang et al., 28 Sep 2025)
- FoMo-Bench/FoMo-Net (Bountos et al., 2023)
Each provides detailed licensing and recommended citation conventions. Standardization of splits, annotation formats, and baseline protocols is a central feature, ensuring comparability and reproducibility in algorithmic benchmarking.