Multi-Temporal Sentinel-2 Imagery

Updated 19 December 2025

Multi-temporal Sentinel-2 imagery is a dense time series of satellite data with 10 m resolution, high revisit frequency, and multi-spectral coverage for precise Earth observation.
It employs advanced deep learning techniques such as GRU, LSTM, ConvLSTM, and temporal attention to extract spatiotemporal features from seasonal and phenological data.
Fusion strategies integrating multi-modal inputs and temporal aggregation enhance mapping accuracy and robust change detection across diverse land cover applications.

Multi-temporal Sentinel-2 imagery refers to dense time series of satellite data collected by the ESA Sentinel-2 system, which offers decametric spatial resolution (10 m for core bands), high revisit frequency (global median ~5 days), and broad multi-spectral coverage. Such imagery supports diverse Earth observation tasks including land cover mapping, agricultural potential estimation, field boundary delineation, change detection, urban mapping, and super-resolution reconstruction. Multi-temporal approaches exploit the temporal dynamics—phenology, seasonality, disturbance events—implicit in sequences of planetary surface reflectances, often in combination with derived vegetation indices and cloud-filtering protocols.

1. Data Acquisition and Temporal Structure

Sentinel-2 provides Level-1C (TOA) and Level-2A (BOA) reflectance products with 13 spectral bands: four at 10 m (Blue B2, Green B3, Red B4, NIR B8), six at 20 m (B5, B6, B7, B8A, B11, B12), and three at 60 m (atmospheric). Typical multi-temporal datasets assemble between 10 and >100 dates per site per annum, selected to minimize cloud cover (e.g., filtering to <2–5 % cloudy pixels per scene) and to maximize temporal regularity (e.g., monthly, seasonal medians, or custom phenological benchmarks) (Sakka et al., 13 Jun 2025, Dimitrovski et al., 2024, Zahid et al., 2024, Sultana et al., 12 Dec 2025). Cloud gaps are commonly filled by linear interpolation or gap-filling algorithms (Gbodjo et al., 2019), or alternatively dropped or smoothed via monthly averaging/tabular aggregation (Dimitrovski et al., 2024, Garioud et al., 2023).

Per-date variables include:

Native reflectance bands (B2, B3, B4, B8), possibly with upsampling of 20/60 m bands to 10 m or 5 m in research settings (Sakka et al., 13 Jun 2025, Tarasiewicz et al., 2023).
Derived NDVI: $NDVI = (\rho_{NIR} - \rho_{Red}) / (\rho_{NIR} + \rho_{Red})$ ; other indices such as EVI, SAVI, NDWI, IRECI, or texture metrics sometimes accompany (Sultana et al., 12 Dec 2025, Zahid et al., 2024).
Vegetation index stacks reflecting seasonal dynamics (Zahid et al., 2024, Sultana et al., 12 Dec 2025).

Object-based aggregation is applied in some studies for noise reduction: high-res segmentation yields super-pixels/objects, followed by per-date averaging to form object-level multivariate time series (Gbodjo et al., 2019, Benedetti et al., 2018).

2. Temporal Feature Extraction and Model Formulations

Deep learning is dominant in extracting the spatio-temporal signatures embedded in multi-temporal Sentinel-2 sequences. Principal architectures include:

Recurrent units: Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) blocks, often with fully connected input enrichment, are standard for processing per-object or per-pixel time series (Gbodjo et al., 2019, Benedetti et al., 2018, Mazzia et al., 2020). FCGRU (fully-connected enriched GRU) formulations expand the raw input $x_t$ via learned nonlinear projections prior to gating:

$x'_t = \tanh(W_2 \tanh(W_1 x_t + b_1) + b_2)$

(cf. (Gbodjo et al., 2019), Eq. 1)

Temporal attention mechanisms: Learnable attention weights on hidden states improve selective focus, with both softmax and tanh activations used. In HOb2sRNN, customized tanh-attention (without normalization to sum-to-1) allows up- or down-weighting each time step independently (including negative contributions), critical for handling strongly seasonal or ambiguous phenology (Gbodjo et al., 2019):

$\lambda = \tanh(\text{score}) = \tanh(\tanh(H W + b) \cdot u)$

where $H = [h_1;...;h_N]$ and $\lambda_i \in [-1,1]$ .

ConvLSTM or attention across space and time: For spatially-resolved segmentation or change detection, convolutional LSTM layers within encoder-decoder frameworks (e.g., U-Net+ConvLSTM) are used; these maintain spatial context in hidden states while propagating temporal information (Papadomanolaki et al., 2019, Dimitrovski et al., 2024). Recent transformer-based architectures (multi-head self-attention over temporal tokens) target long series (e.g., 45 dates) for domain-adversarial adaptation (Martini et al., 2021) and sequence modeling.
Self-supervised and multi-modal approaches: Self-supervised pretraining exploits radiometric invariance between overlapping patches and triplet-margin losses to learn transferrable features for change detection (Leenstra et al., 2021). Multi-modal fusion combines Sentinel-2 with other modalities (e.g., Sentinel-1 SAR, aerial VHR, PlanetScope) either by late fusion in latent space (Hafner et al., 2023, Dimitrovski et al., 2024, Garioud et al., 2023), or by reconstructing missing optical features from SAR (Hafner et al., 2023).

Multi-temporal Sentinel-2 imagery is maximally exploited using advanced fusion schemes:

Temporal fusion: Recursive pairwise merging (e.g., HighResNet) (Okabayashi et al., 2024), temporal max-pooling across features (Jindgar et al., 2024, Martini et al., 2021), and permutation-invariant mean-pooling (SPInet) (Valsesia et al., 2022).
Spectral fusion: Simultaneous integration across bands at each GSD, and cross-resolution fusion for super-resolving lower-resolution bands (DeepSent) (Tarasiewicz et al., 2023).
Modal fusion: Multi-source architectures combine features from Sentinel-2 optical, Sentinel-1 radar, VHR aerial, or PlanetScope, with dedicated feature branches and fusion nodes (Gbodjo et al., 2019, Benedetti et al., 2018, Hafner et al., 2023, Garioud et al., 2023, Dimitrovski et al., 2024). Two-stream fusion with weighted loss/classification heads is recommended for operational land cover mapping (Gbodjo et al., 2019).
Spatial fusion: Object aggregation, tiling, and super-pixel strategies are applied for spatial noise and speckle reduction, critical in crop mapping and land-cover segmentation (Gbodjo et al., 2019, Benedetti et al., 2018, Sakka et al., 13 Jun 2025).

4. Applications: Land Cover Mapping, Change Detection, Agricultural Analytics, Super-Resolution, and Field Delineation

Land Cover and Crop Classification: Recurrent convolutional architectures (Pixel R-CNN, FCGRU+attention) learn phenological signatures to classify >15 crop/vegetation classes with overall accuracy up to 96.5 % and Cohen’s $\kappa=0.914$ (Mazzia et al., 2020, Gbodjo et al., 2019, Benedetti et al., 2018). Object-based aggregation and multi-source fusion further improve results.
Functional Field Boundary Extraction: Multi-date NDVI stacks facilitate boundary delineation, encoding crop growth and senescence for improved IoU by 5–8 pp compared to single-date input (Zahid et al., 2024). Transfer learning indicates scale/geography sensitivity; multi-region training increases generalizability.
Change Detection: Multi-temporal image pairs enable shallow CNN-based self-supervised pretraining on unlabeled stacks, supporting unsupervised and supervised change vector analysis (Leenstra et al., 2021, Papadomanolaki et al., 2019). ConvLSTM-augmented networks outperform bi-temporal-only approaches, with F1 gains up to +1.5 pp (Papadomanolaki et al., 2019).
Agricultural Potential Mapping: Monthly Sentinel-2 cubes are used for pixel-wise ordinal regression on viticulture, market gardening, and field crops (Sakka et al., 13 Jun 2025). Multi-label and spatio-temporal (3D-CNN, ConvLSTM) tasks are supported; baseline UNet accuracy is enhanced using ordinal targets.
Super-Resolution: Multi-temporal fusion recovers fine spatial structure at 2.5–3.3 m GSD by merging temporal sequences with recursive fusion and prior-informed deep SISR backbones (SEN4X, DeepSent, SPInet) (Retnanto et al., 30 May 2025, Tarasiewicz et al., 2023, Valsesia et al., 2022, Okabayashi et al., 2024). Multi-modal super-resolved segmentation at 2.5 m (SPInet) achieves MCC=0.802–0.862, outperforming standard CNN baselines by +0.119 MCC (Valsesia et al., 2022). Temporal attention and permutation invariance increase robustness to date order and cloud noise.
Semantic Segmentation with Pre-trained Backbones: Latent space temporal-max fusion yields +5–17 % mIoU improvement over single-image or output-fusion approaches using SWIN, U-Net, or ViT pre-trained architectures (Jindgar et al., 2024, Dimitrovski et al., 2024).
Invasive Species Monitoring: Multi-seasonal feature engineering offers comparable accuracy to high-resolution aerial, with Sentinel-2 model M76* (OA=68 %, $\kappa$ =0.55) slightly outperforming aerial reference (OA=67 %, $\kappa$ =0.52). NDVI, EVI, SAVI, NDWI, IRECI, TDVI, NLI, MNLI computed per season and texture metrics form the feature basis (Sultana et al., 12 Dec 2025).

5. Quantitative Findings and Comparative Performance

A sampling of representative quantitative results is presented for quick reference.

Application	Model/Method	mIoU / OA / F1 / MCC	Dataset / Region	Notable Finding
Land cover mapping	HOb2sRNN (S2-only)	F1=78.7–87.6 %	Reunion, Senegal	Multi-source fusion: +1 pp F1
Land cover segmentation	M³Fusion GRU+att + CNN	OA=90.7 %	Reunion	Fusion head: +3 pp OA over RF
Crop classification	Pixel R-CNN (LSTM+CNN)	OA=96.5 %	North Italy	+20 pp above RF/SVM/XGBoost
Field boundary delineation	UNet (NDVI stack)	IoU=0.74	Netherlands, Pakistan	NDVI temporal stacking: +5–8 pp IoU
Change detection	U-Net+ConvLSTM	OA=96 % / F1=57.78 %	OSCD urban scenes	5 dates w/convLSTM: +1.5 F1 vs 2date
Urban mapping (cloud cover)	U-Net (S2+S1+SAR+reconstruction)	F1=0.423	SpaceNet-7, 14 sites	Retains S2 features via SAR reconstr
Semantic segmentation	FLAIR U-TAE branch	mIoU=39.68 %	France (IGN FLAIR)	Best when fused with aerial VHR
Super-resolution segmentation	SPInet (PIUnet+MRF, 2.5 m SR mask)	MCC=0.802	AI4EO Italy	+0.12 MCC vs DeepLabv3
HR SR for urban mapping	SEN4X (MISR+SISR)	mIoU_macro=51.6 %	Hanoi, Vietnam	+2.7 pp mIoU (SISR), +12.9 pp (MISR)
Invasive grass species	S2 RF (multi-season/phenology: M76*)	OA=68 %, $\kappa$ =0.55	Victoria, Australia	Slightly outperforms best aerial

6. Best Practices, Limitations, and Future Directions

Best Practices:
- Normalize input reflectances to [0,1], filter cloud-contaminated scenes.
- Aggregate input time series by object/patch or context window (e.g., 128×128).
- Prefer deep temporal architectures (FCGRU+attention, ConvLSTM, temporal transformers) with supplementary attention or hierarchical pretraining for limited-label regimes (Gbodjo et al., 2019, Martini et al., 2021).
- For fusion, latent-space temporal-max, recursive multi-image fusion, and permutation-invariant mean pools are recommended.
- For operational mapping, object-based multi-temporal S2+S1 fusion with attention mechanism is efficient (Gbodjo et al., 2019).
- Multi-temporal NDVI stacking for boundary extraction leverages phenological cues better than raw bands, with reduced compute (Zahid et al., 2024).
Limitations:
- Sentinel-2 spatial resolution constrains detection of sub-pixel objects (roads, narrow field boundaries); super-resolution or modal fusion partially addresses this.
- Geographic or phenological domain gaps degrade cross-region model transfer; domain-adversarial training alleviates but does not eliminate mismatch (Martini et al., 2021).
- Monthly averaging may undersample rapid events and blur phenology; finer grids are preferable given computational resources.
- Object-based, MLP/SVM baselines approach deep model performance in label-scarce regimes but fail to match multi-modal RNNs.
Future Directions:
- Longer time series (5–45 dates) for improved modeling of phenological cycles, weighted against compute cost.
- Advanced temporal encoders: deep transformers, attention-unified ConvLSTM/self-attention hybrids.
- Joint diffusion, adversarial, and spectral-angle mapper losses to balance fidelity and perceptual realism in SR (Okabayashi et al., 2024, Retnanto et al., 30 May 2025).
- Incorporation of active learning, semi-supervised labeling, or topological priors for functional field delineation (Zahid et al., 2024).
- Fusion with SAR, VHR, or planetary data for domain-invariant feature reuse and enhanced robustness (Hafner et al., 2023, Garioud et al., 2023, Dimitrovski et al., 2024).
- Expanded use for continuous variables (crop yield, density) and irregular geographic domains (Sakka et al., 13 Jun 2025, Sultana et al., 12 Dec 2025).

Multi-temporal Sentinel-2 imagery forms the backbone of modern remote sensing pipelines, enabling rich statistical, deep learning, and multi-modal fusion approaches for accurate, scalable Earth surface monitoring. Multiple sequential acquisitions offer critical temporal cues for both discrete and continuous mapping tasks, rendering simple single-date/pixel approaches obsolete for most practical applications.