Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multimodal Remote Sensing Datasets

Updated 5 February 2026
  • Multimodal remote sensing datasets are standardized archives combining diverse sensor modalities (e.g., optical, SAR, LiDAR) to evaluate ML models reproducibly.
  • They incorporate co-registered imagery, non-imagery data, and detailed annotations to support tasks like classification, detection, segmentation, and retrieval.
  • These benchmarks enable advanced fusion techniques and cross-modal reasoning, facilitating applications in disaster monitoring, climate analysis, and geolocation.

Multimodal remote sensing benchmark datasets comprise standardized public archives spanning two or more distinct sensor modalities (e.g., optical, SAR, multispectral, hyperspectral, infrared, LiDAR, maps, or textual descriptions) that support rigorous, reproducible evaluation of machine learning models for remote sensing tasks. These resources have become foundational across land cover mapping, object detection, geocoding, cross-modal retrieval, generative modeling, and advanced multimodal LLM (MLLM) research. Their design increasingly reflects the spectrum of contemporary sensing platforms, tasks, and cross-modal reasoning challenges present in real-world geoscience and Earth observation.

1. Sensor Modalities and Dataset Composition

Recent benchmark datasets exhibit multimodality at different levels of spatial, spectral, and semantic granularity:

Modalities are typically co-registered to enable pixel-, region-, or object-level fusion, and annotated for semantics (class labels, bounding boxes, segmentation masks, ground-truth coordinates, QA, or change maps).

2. Benchmark Tasks and Evaluation Protocols

Multimodal RS datasets support a broad spectrum of standardized challenges:

All major benchmarks report task-specific metrics in standardized forms:

  • Classification: accuracy, F1, OA, macro/micro averages.
  • Detection: mean average precision (mAP) at multiple IoU thresholds, oriented/horizontal box support.
  • Segmentation: per-class IoU, mean IoU (mIoU), pixel accuracy.
  • Retrieval: Recall@K, Precision@1 (P@1), mean average precision (mAP).
  • VQA/QA: accuracy, BLEU/METEOR/ROUGE for captions, closed-form binary or multiple-choice QA.
  • Generation: FID, Inception Score, SSIM, PSNR, ERGAS, SAM.
  • Tracking/3D tasks: for UAV/MM3D, CLEAR-MOT metrics, ADE/FDE for trajectory prediction (Zou et al., 27 Nov 2025).

3. Data Curation, Preprocessing, and Annotation Strategies

Benchmark integrity and reproducibility depend on rigorous data handling:

  • Preprocessing pipelines: Sensor-specific corrections (radiometric, atmospheric, geometric), resampling (e.g., all bands to 10 m), patching or tiling (e.g., 512×512 for MMM-RS (Luo et al., 2024), 120×120 for BigEarthNet-MM (Sumbul et al., 2021)), cloud/shadow masking, and co-registration across modalities.
  • Quality control: Manual, hybrid, or LLM-assisted pipelines for cross-modal annotation (e.g., MapData uses manual correspondence, RANSAC, template matching (Wu et al., 20 Mar 2025); XLRS-Bench employs iterative LLM+human captioning (Wang et al., 31 Mar 2025)). Cloud/snow/shadow filtering is often critical for optical.
  • Semantic schema: Unified class ontologies (19-class mapping in BigEarthNet-MM (Sumbul et al., 2021), long-tail 61-class ship taxonomy in SMART-Ship (Fan et al., 4 Aug 2025), 13-class LULC for C2Seg (Hong et al., 2023)) or hierarchical/chain-of-thought text (SAR-GEOVL-1M HCoT (Yang et al., 28 Sep 2025)).
  • Annotation formats: JSON, CSV, GeoTIFF, or MATLAB; typically standardized, with provided code/dataloaders for splits and tasks.
  • Official splits: Train/val/test partitions designed to control for spatial, temporal, and semantic leakage (e.g., event-level split in MONITRS (Revankar et al., 22 Jul 2025), patch/city split in C2Seg (Hong et al., 2023)).

4. Notable Datasets and Comparative Landscape

A non-exhaustive list of leading multimodal RS datasets, with task, modalities, and scale:

Dataset Modalities Tasks Scale/Notes Reference
BigEarthNet-MM Sentinel-1 (SAR), S2 (MSI) Multilabel classification, retrieval 590K pairs (Sumbul et al., 2021)
MONITRS Sentinel-2 RGB, text/news QA Disaster tracking, temporal VQA 10K events (Revankar et al., 22 Jul 2025)
SMART-Ship RGB, SAR, PAN, NIR, MS Detection, Re-ID, change, pan-sharpening, generation 1K scenes/38K ships (Fan et al., 4 Aug 2025)
MMM-RS RGB, SAR, NIR, text Text-to-image, cross-modal generation 2.1M samples (Luo et al., 2024)
RSMEB/VLM2GeoVec RGB, text, coords, boxes Classification, retrieval, geo-localization, VQA 205K queries (Aimar et al., 12 Dec 2025)
SpaceNet 6 SAR, optical (EO) Instance segmentation (buildings) 48K polygons (Shermeyer et al., 2020)
MapData/MapGlue Map (raster), visible image Image matching 121K pairs (Wu et al., 20 Mar 2025)
XLRS-Bench UHR panchromatic, text Reasoning, grounding, VQA 1400 images (Wang et al., 31 Mar 2025)
C2Seg HS, MS, SAR Cross-city semantic segmentation 4 images, multi-M (Hong et al., 2023)
MagicBathyNet S2, SPOT-6, aerial, DSM Bathymetry, seabed segmentation 3K triplets (Agrafiotis et al., 2024)
UAV-MM3D RGB, IR, DVS, LiDAR, Radar 3D detection, trajectory, tracking 400K frames (Zou et al., 27 Nov 2025)

Key trends are the scaling up of sample size (multi-million pairs), incorporation of temporal and multi-scene coverage, and standardization of API/code for task benchmarking.

5. Advances in Fusion Approaches and Modal-Specific Protocols

Benchmarks enable the study of fusion strategies and cross-modal learning:

  • Early, feature, and late fusion: Early channel stacking, mid-level concatenation, or late ensemble/logit fusion (GAMUS (Xiong et al., 2023), S2FL (Hong et al., 2021), SMART-Ship (Fan et al., 4 Aug 2025)).
  • Transformer/attention architectures: Intermediary fusion tokens (TIMF in GAMUS), cross-attention, and multi-head strategies dominate recent segmentation and detection systems (Xiong et al., 2023, Li et al., 2024).
  • Grid-level and sparse MoE backbones: Grid-level sparse mixture-of-experts (SM3Det (Li et al., 2024)) facilitate dynamic allocation of model capacity to each modality/task.
  • Alignment-specific protocols: Homography/matching benchmarks (MapData), temporal sequence fusion (MONITRS), cross-spectral translation baselines (CycleGAN, ControlNet (Luo et al., 2024)).
  • Evaluation stratified by modality: Distinct metrics for each source (e.g., HBB vs. OBB for SAR/optical/IR in SOI-Det (Li et al., 2024)), ablation studies, and domain gap analyses.

6. Applications, Limitations, and Future Benchmark Directions

Benchmarks unlock research in:

Limitations observed:

  • Modal coverage gaps: Not all combinations (e.g., SAR+NIR+HS+DSM+LiDAR) are publicly available; NIR and HS often underrepresented except in benchmark releases (e.g., GAMUS (Xiong et al., 2023), MagicBathyNet (Agrafiotis et al., 2024), C2Seg (Hong et al., 2023)).
  • Single-region or short-timescale coverage: Many datasets are single-epoch or fixed geography.
  • Annotation/label drift: Inconsistent label schema, temporal mismatch between RS and ground/societal sources.
  • Long-tailed class/semantic imbalance: Prominent for rare micro-objects or fine-grained taxonomies (e.g., SMART-Ship, TalloS).

Anticipated directions include: global temporal archives, multi-resolution and multi-domain expansion, more realistic multi-weather/atmosphere conditions, richer QA/captioning with formal correctness checks, and direct support for 3D/UAV perceptual tasks.

7. References and Data Accessibility

Most benchmarks provide public access to data, pre-trained models, training/evaluation code, and detailed documentation:

Each provides detailed licensing and recommended citation conventions. Standardization of splits, annotation formats, and baseline protocols is a central feature, ensuring comparability and reproducibility in algorithmic benchmarking.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Remote Sensing Benchmark Datasets.