PASTIS-R Benchmark: Multimodal Crop Mapping
- The paper introduces a benchmark that provides detailed instance-level and semantic annotations for crop type mapping using multimodal satellite time series.
- It combines over 2,400 geo-referenced patches with temporally aligned Sentinel-2 and Sentinel-1 data, enabling comprehensive evaluations of parcel classification and segmentation tasks.
- Experimental results show that late fusion strategies with auxiliary supervision and temporal dropout yield the highest performance under challenging conditions such as cloud occlusion.
PASTIS-R Benchmark is a large-scale, multimodal remote sensing dataset and benchmarking suite designed for crop type mapping from satellite time series using both optical and radar modalities. Built as an extension of the original PASTIS dataset, PASTIS-R incorporates temporally aligned Sentinel-2 multispectral data and Sentinel-1 C-band Synthetic Aperture Radar (SAR) acquisitions, with detailed instance-level and semantic annotations. It serves as the first publicly available benchmark to enable systematic comparison of spatio-temporal multimodal models for three core agricultural mapping tasks: parcel classification, pixel-level semantic segmentation, and panoptic parcel segmentation. The benchmark facilitates investigating multimodal temporal fusion strategies, the robustness of models to inconsistent optical data caused by cloud cover, and the benefits of combining diverse Earth observation modalities for dense temporal analysis (Garnot et al., 2021).
1. Dataset Composition and Specifications
PASTIS-R comprises 2,433 geo-referenced image patches, each spanning 1.28 km × 1.28 km at 10 m ground sampling distance (128×128 pixels). Each patch contains multiple agricultural parcels; these are individually annotated with unique instance identifiers and one of 18 crop-type labels. For semantic and panoptic segmentation tasks, an additional “background” class is included, totaling 19 classes.
The optical component consists of Sentinel-2 Level 2A images covering the 2019 growing season, with approximately 47 multispectral acquisitions per patch (about 115,000 images total), at an average revisit interval of five days. The radar component includes both ascending (S1A) and descending (S1D) Sentinel-1 SAR time series, each with three channels: VV, VH, and the computed VV/VH ratio, yielding approximately 70 acquisitions per patch per orbit (~339,000 radar images). Temporal alignment for early fusion is achieved by interpolating radar acquisitions to optical timestamps; for other fusion forms, each modality retains its native cadence.
The geographic domain covers four agro-climatic regions in France (Auvergne–Rhône-Alpes, Grand Est, Hauts-de-France, Occitanie). Dataset splits employ five-fold cross-validation, where each region alternately serves as the held-out test set, maintaining the protocol established for the original PASTIS dataset.
2. Core Benchmark Tasks and Evaluation Metrics
PASTIS-R standardizes three principal tasks, providing a consistent experimental platform and metrics.
2.1 Parcel Classification:
Given vector masks for all parcels in a patch, the model predicts a single crop-type label per parcel. The principal metric is mean Intersection over Union (mIoU) across 18 crop classes:
2.2 Pixel-Based Semantic Segmentation:
Each pixel is assigned one of 19 semantic classes (18 crops + background). The evaluation metric is mean IoU over all classes, as above.
2.3 Panoptic Parcel Segmentation:
This requires joint instance segmentation and semantic labeling. Performance is quantified by:
- Recognition Quality (RQ): quantifies the accuracy of instance detection and classification.
- Segmentation Quality (SQ): mean IoU over all matched prediction–ground-truth pairs.
- Panoptic Quality (PQ): aggregate metric given by , where
Here, true positives (TP) correspond to predicted instances with IoU ≥ 0.5 and correct class; false positives (FP) are unmatched predictions; false negatives (FN) denote missed ground-truth instances.
3. Multimodal Model Architectures and Fusion Strategies
The benchmark focuses on temporal attention-based backbones, evaluating how to integrate information from multimodal time series:
3.1 Core Backbones
- Parcel classification:
- Pixel-Set Encoder (PSE): aggregates pixel features within each parcel at each time step into a parcel-level embedding.
- Lightweight Temporal Attention Encoder (L-TAE): multi-head self-attention over the temporal sequence of embeddings; outputs per-parcel feature vectors followed by an MLP and softmax for label prediction.
- Semantic and Panoptic segmentation:
- U-TAE: a U-Net variant employing temporal attention blocks interleaved with spatial convolutions, generating per-pixel spatio-temporal feature maps.
- For panoptic segmentation, a Parcel-as-Points (PaPs) instance head is appended with a composite loss structure targeting classification, mask regression, and clustering.
3.2 Fusion Strategies
Let (modalities: S2, S1A, S1D), . Four fusion strategies are benchmarked:
- Early Fusion:
Temporally aligns all modalities to a unified grid ( dates), concatenates along the channel axis, and processes with a shared spatio-temporal encoder.
- Late Feature Fusion:
Processes raw time series from each modality independently through modality-specific encoders, then concatenates resulting embeddings for joint decoding.
- Decision Fusion:
Each modality’s encoder/decoder pair independently predicts, and the outputs are averaged:
- Mid Fusion (novel):
Applies a spatial encoder per date and modality, concatenates temporal sequences across modalities before a shared temporal encoder and decoder.
Enhancements: Late and decision fusion are augmented by auxiliary supervision (additional loss terms per modality branch to avoid gradient collapse to a dominant modality), and all strategies can utilize temporal dropout (randomly omitting observations during training) to improve generalization and stability to missing data.
4. Training Protocols and Implementation Details
All tasks use the Adam optimizer with a base learning rate of 0.001 and specific batch sizes (128 for parcel, 4 for patch-based tasks) on V100 32GB GPUs. For panoptic segmentation, training comprises 100 epochs, with a learning rate decay after 50 epochs. Models are evaluated using 5-fold cross-validation across geographic regions.
Loss functions employed include standard cross-entropy for classification tasks and a panoptic loss for instance segmentation, as detailed by the Parcel-as-Points (PaPs) framework (Garnot et al., 2021). Temporal dropout is a standard augmentation, with optical data dropped at , and radar dropped at during training but not at inference. Preprocessing involves no cloud masking for the optical data and no speckle filtering or terrain correction for SAR, to maintain temporal density and permit realistic robustness testing.
5. Experimental Results and Comparative Analysis
5.1 Parcel Classification
Late fusion with both temporal dropout and auxiliary loss achieves the highest mIoU at 77.2%, outperforming S2-only models by 3.3 points. Early and mid fusion also improve over unimodal performance. SAR-alone models lag optical-only by ~10 points, but their complementary contribution is crucial under adverse conditions (e.g., cloud cover).
5.2 Semantic Segmentation
Late fusion (with enhancements) achieves the best mIoU (66.3), exhibiting significant gains at parcel boundaries due to SAR’s information content.
5.3 Panoptic Segmentation
Late fusion with dropout achieves PQ = 41.6, a +1.2 point increase over S2-only. The improvement is most notable in Recognition Quality (RQ), denoting better instance localization and labeling.
5.4 Cloud Robustness
Reduction of available optical observations at test time (to mimic cloud occlusion) demonstrates the value of multimodal fusion:
- S2-only mIoU or PQ degrades rapidly beyond 50% missing data.
- Fusion (especially decision fusion) maintains accuracy and outperforms radar-only models even with >70% optical loss.
5.5 Efficiency
Early fusion requires interpolation overhead; late fusion is marginally slower at inference and has higher parameter count, but delivers the highest accuracy and robust generalization.
Summary Table: Best Results per Task
| Task | Best Fusion Strategy | mIoU/PQ (%) | Comments |
|---|---|---|---|
| Parcel Classification | Late (+drop+aux) | 77.2 | Highest accuracy |
| Semantic Segmentation | Late (+drop+aux) | 66.3 (mIoU) | Improved boundaries |
| Panoptic Segmentation | Late (+drop) | 41.6 (PQ) | Best overall instance PQ |
6. Limitations and Recommendations
PASTIS-R’s regional scope is restricted to four regions of France, possibly limiting generalization to other agro-ecological zones or crop types (such as rice paddies). Only backscatter intensity is used for SAR; the addition of interferometric or full-polarimetric features may further benefit certain tasks. Cloud masking is not performed on the optical data, supporting rigorous robustness tests, but future research can explore explicit cloud gap-filling or optical restoration methods leveraging radar. Parcel geometry is used as provided, without geometric augmentation.
Auxiliary losses are essential to avoid fused models “collapsing” onto a single dominant modality, especially in the presence of class or modality imbalance. Temporal dropout is crucial for memory efficiency and generalizability.
7. Data Availability and Impact
PASTIS-R and its associated codebase (CC-BY 4.0) are available at https://zenodo.org/record/5735646, enabling reproduction and further research in multimodal temporal learning from satellite time series. The benchmark has driven new developments in temporal attention networks, multimodal fusion, and crop mapping tasks, with systematic evaluations clarifying tradeoffs among network designs for robustness, accuracy, and computational efficiency. Its influence spans applications in agricultural monitoring, remote sensing, and temporal representation learning (Garnot et al., 2021).