EcoWikiRS: Ecological Vision-Language Dataset

Updated 20 January 2026

EcoWikiRS is a large-scale ecological vision–language dataset combining 91,801 aerial image tiles with curated Wikipedia habitat sentences for RS ecosystem classification.
It implements a weak supervision strategy using a bag-of-sentences approach and the WINCEL loss to robustly align visual features with ecological language cues.
Empirical evaluations demonstrate significant improvements in zero-shot classification accuracy on Swiss ecosystems compared to baseline models.

EcoWikiRS is a vision–language dataset and training framework introduced to infuse ecological knowledge into high-resolution aerial image understanding by aligning remote sensing (RS) images with species habitat descriptions derived from Wikipedia. It systematically leverages crowd-sourced species occurrences and domain-rich language supervision, enabling scalable and semantically informed learning for RS vision–LLMs (RS-VLMs) in ecological applications (Zermatten et al., 28 Apr 2025).

1. Dataset Construction and Composition

EcoWikiRS was assembled over the territory of Switzerland using N = 91 801 aerial image tiles, each of 100 m × 100 m, sourced from the Federal Office of Topography’s swissIMAGE product. The raw imagery is at 10 cm spatial resolution but was down-sampled to 50 cm for computational efficiency. Each tile is georeferenced and aligned with the European Nature Information System (EUNIS) ecosystem type map, thereby providing ground-truth habitat labels at the 100 m scale.

Species observations are drawn from the Global Biodiversity Information Facility (GBIF), covering the Animalia and Plantae taxa mapped to Switzerland from 1950 to 2024. After rigorous quality control—removal of uncertain geolocations (>100 m), incomplete taxonomy, coordinate rounding, lack of an associated English Wikipedia article, and duplicates—274 241 unique species occurrences across 2 745 species remain. Each observation is assigned to the corresponding aerial tile, and only tiles with at least one species observation are retained.

For every species, the 2025 English Wikipedia article is parsed after discarding non-informative sections. Sentences are extracted and categorized as follows:

Habitat sentences: From sections titled with “habitat,” “ecology,” “distribution,” or “range.”
Keyword sentences: Containing at least one of ≈200 preselected ecology-related keywords.
Random sentences: All unfiltered sentences.
Species names: Instances of the binomial name.

Dataset statistics are outlined below:

Category	Sentence Instances	Unique Sentences
Habitat sentences	2 998 305	18 693
Keyword sentences	3 728 644	21 832
Random sentences	19 642 735	103 065

On average, a tile links to ten habitat sentences, fourteen keyword sentences, or seventy random sentences. To ensure spatial independence in evaluation, the dataset is partitioned into train (60 %), validation (10 %), and test (30 %) splits using a 20 km spatial block split.

2. Weak Supervision and Ecological Alignment

EcoWikiRS embraces a setting of weak and noisy supervision. Each image tile inherits all habitat sentences for every observed species, introducing two principal noise sources:

False positives: Sentences irrelevant to observable features in the image, commonly due to broad-niche or generalist species.
False negatives: Sentences describing attributes shared by multiple tiles, which may be erroneously considered “negatives” during batch-based learning.

Instead of a single sentence per image, EcoWikiRS retains a “bag” of up to K = 15 sentences per tile and learns instance-specific sentence weights reflecting similarity to the visual content. This strategy is motivated by the inherent many-to-many relationships and label noise characteristic of ecological data.

3. WINCEL Loss: Weighted InfoNCE for Noisy Alignment

To robustly align visual features from RS imagery with ecological language cues under severe supervision noise, EcoWikiRS introduces WINCEL, a weighted extension of the InfoNCE contrastive loss. Let $I_n$ denote a tile image and $V_n = f_v(I_n)$ its vision encoder embedding; ${T_{n,k} = f_t(s_{n,k})}_{k=1}^K$ are the corresponding text embeddings for up to $K$ sentences.

For each tile, sentence weights $\alpha_{n,k}$ are computed by:

$\alpha_{n,k} = \text{softmax}_k \left( \frac{V_n \cdot T_{n,k}}{\tau} \right)$

where $\tau$ is a temperature parameter. The weighted text embedding $G_n$ is then:

$G_n = \sum_{k=1}^K \alpha_{n,k} T_{n,k}$

The WINCEL loss applied over a batch of $N$ is:

$\mathcal{L}_{\text{WINCEL}} = -\sum_{n=1}^N \log \left[ \frac{\exp( V_n \cdot G_n / \tau )} {\sum_{j=1}^N \exp( V_n \cdot G_j / \tau ) } \right]$

WINCEL down-weights spurious or irrelevant sentences for each tile on the fly and demonstrates empirical superiority over vanilla InfoNCE and alternative noise-mitigation techniques (such as bootstrapping or top-k sampling). This formulation is particularly adapted to the bag-of-weak-labels context found in ecological VLMs (Zermatten et al., 28 Apr 2025).

4. Model Backbones, Training Paradigm, and Implementation

EcoWikiRS investigations use four ViT-B/32-based backbones:

CLIP (trained on LAION-2B)
RemoteCLIP
SkyCLIP
GeoRSCLIP

Each model offers a vision encoder $f_v$ and a text encoder $f_t$ . The training regimens freeze the text side and fine-tune the vision encoder’s positional encodings and final projection head to optimize the representation for ecological tasks, balancing stability and adaptation.

Key ingredients include:

Optimizer: AdamW (learning rate $1 \times 10^{-4}$ ), stepwise LR decay (0.95 every 2 epochs)
Batch size: 256
Epochs: 60
Data augmentation: random flip, rotation, center crop, color jitter
Temperature: $\tau=0.07$ (InfoNCE), $\tau=0.15$ (WINCEL), selected by grid search
Sentence bag size: up to $K=15$ (zero-padded as necessary)
Implementation: PyTorch, OpenCLIP, trained on NVIDIA V100/A100 GPUs

5. Evaluation Protocols and Empirical Performance

Zero-shot ecosystem classification is evaluated on 25 EUNIS Level-2 habitat categories. During inference, the frozen vision encoder outputs a tile embedding, which is then compared (by cosine similarity) to the class name prompts’ textual embeddings; the highest similarity determines the predicted class. Metrics reported are overall accuracy (OA) and macro-averaged F1.

Baselines span:

(a) Pretrained zero-shot
(b) Fine-tuning on EcoWikiRS with vanilla InfoNCE
(c) Fine-tuning with WINCEL
(d) Fully supervised upper bound (SkyCLIP image encoder, cross-entropy on EUNIS labels)

Representative results:

Backbone	OA Pretrained	OA InfoNCE	OA WINCEL	OA Supervised
CLIP	14.7%	25.3%	30.9%	—
SkyCLIP	19.2%	27.1%	30.1%	52.7%

WINCEL yields consistent gains over InfoNCE and baseline pretrained models, especially in the presence of label noise (Zermatten et al., 28 Apr 2025).

Ablation studies indicate:

Habitat-section sentences outperform keyword, random, or species name sentences, highlighting the importance of domain-relevant textual cues.
Alternative label noise reduction techniques (bootstrapping, top-k sampling, substring augmentation) offer only modest improvements over InfoNCE, remaining inferior to WINCEL.
Fine-tuning both positional encodings and final projection head proves most effective.

Qualitative analyses show that WINCEL-trained models generate precise zero-shot similarity maps for prompts like “It prefers a warm and dry climate,” aligning well with geographic gradients or environmental features. In cross-modal retrieval, WINCEL models favor ecologically pertinent sentences for image tiles, in contrast to pretrained models’ preference for generic or irrelevant text.

6. Accessibility and Licensing

The EcoWikiRS dataset, curated preprocessing tools, training and evaluation code, and comprehensive reproduction instructions are available under an open-source license at:

https://github.com/eceo-epfl/EcoWikiRS

Users must adhere to the original data sources’ licenses: GBIF occurrences (Creative Commons Attribution Non Commercial 4.0 International) and Wikipedia text (standard Wikipedia free-use terms).

7. Significance and Implications

EcoWikiRS constitutes a large-scale, ecologically grounded vision–language resource with 91 000+ aerial tiles, 274 000+ species occurrences, and millions of curated habitat sentences. It is specifically configured for weak, noisy supervision and offers robust alignment between RS imagery and ecological language. The dataset enables and improves zero-shot ecosystem classification and interpretability for RS-VLMs in environmental applications. A plausible implication is that the proposed WINCEL approach can serve as a framework for other weakly supervised or noisy-label tasks in geospatial machine learning and ecological remote sensing, fostering ecologically meaningful modeling at continental scales (Zermatten et al., 28 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observations and Wikipedia (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EcoWikiRS Dataset.