EcoWikiRS: Ecological Vision-Language Dataset
- EcoWikiRS is a large-scale ecological vision–language dataset combining 91,801 aerial image tiles with curated Wikipedia habitat sentences for RS ecosystem classification.
- It implements a weak supervision strategy using a bag-of-sentences approach and the WINCEL loss to robustly align visual features with ecological language cues.
- Empirical evaluations demonstrate significant improvements in zero-shot classification accuracy on Swiss ecosystems compared to baseline models.
EcoWikiRS is a vision–language dataset and training framework introduced to infuse ecological knowledge into high-resolution aerial image understanding by aligning remote sensing (RS) images with species habitat descriptions derived from Wikipedia. It systematically leverages crowd-sourced species occurrences and domain-rich language supervision, enabling scalable and semantically informed learning for RS vision–LLMs (RS-VLMs) in ecological applications (Zermatten et al., 28 Apr 2025).
1. Dataset Construction and Composition
EcoWikiRS was assembled over the territory of Switzerland using N = 91 801 aerial image tiles, each of 100 m × 100 m, sourced from the Federal Office of Topography’s swissIMAGE product. The raw imagery is at 10 cm spatial resolution but was down-sampled to 50 cm for computational efficiency. Each tile is georeferenced and aligned with the European Nature Information System (EUNIS) ecosystem type map, thereby providing ground-truth habitat labels at the 100 m scale.
Species observations are drawn from the Global Biodiversity Information Facility (GBIF), covering the Animalia and Plantae taxa mapped to Switzerland from 1950 to 2024. After rigorous quality control—removal of uncertain geolocations (>100 m), incomplete taxonomy, coordinate rounding, lack of an associated English Wikipedia article, and duplicates—274 241 unique species occurrences across 2 745 species remain. Each observation is assigned to the corresponding aerial tile, and only tiles with at least one species observation are retained.
For every species, the 2025 English Wikipedia article is parsed after discarding non-informative sections. Sentences are extracted and categorized as follows:
- Habitat sentences: From sections titled with “habitat,” “ecology,” “distribution,” or “range.”
- Keyword sentences: Containing at least one of ≈200 preselected ecology-related keywords.
- Random sentences: All unfiltered sentences.
- Species names: Instances of the binomial name.
Dataset statistics are outlined below:
| Category | Sentence Instances | Unique Sentences |
|---|---|---|
| Habitat sentences | 2 998 305 | 18 693 |
| Keyword sentences | 3 728 644 | 21 832 |
| Random sentences | 19 642 735 | 103 065 |
On average, a tile links to ten habitat sentences, fourteen keyword sentences, or seventy random sentences. To ensure spatial independence in evaluation, the dataset is partitioned into train (60 %), validation (10 %), and test (30 %) splits using a 20 km spatial block split.
2. Weak Supervision and Ecological Alignment
EcoWikiRS embraces a setting of weak and noisy supervision. Each image tile inherits all habitat sentences for every observed species, introducing two principal noise sources:
- False positives: Sentences irrelevant to observable features in the image, commonly due to broad-niche or generalist species.
- False negatives: Sentences describing attributes shared by multiple tiles, which may be erroneously considered “negatives” during batch-based learning.
Instead of a single sentence per image, EcoWikiRS retains a “bag” of up to K = 15 sentences per tile and learns instance-specific sentence weights reflecting similarity to the visual content. This strategy is motivated by the inherent many-to-many relationships and label noise characteristic of ecological data.
3. WINCEL Loss: Weighted InfoNCE for Noisy Alignment
To robustly align visual features from RS imagery with ecological language cues under severe supervision noise, EcoWikiRS introduces WINCEL, a weighted extension of the InfoNCE contrastive loss. Let denote a tile image and its vision encoder embedding; are the corresponding text embeddings for up to sentences.
For each tile, sentence weights are computed by:
where is a temperature parameter. The weighted text embedding is then:
The WINCEL loss applied over a batch of is:
WINCEL down-weights spurious or irrelevant sentences for each tile on the fly and demonstrates empirical superiority over vanilla InfoNCE and alternative noise-mitigation techniques (such as bootstrapping or top-k sampling). This formulation is particularly adapted to the bag-of-weak-labels context found in ecological VLMs (Zermatten et al., 28 Apr 2025).
4. Model Backbones, Training Paradigm, and Implementation
EcoWikiRS investigations use four ViT-B/32-based backbones:
- CLIP (trained on LAION-2B)
- RemoteCLIP
- SkyCLIP
- GeoRSCLIP
Each model offers a vision encoder and a text encoder . The training regimens freeze the text side and fine-tune the vision encoder’s positional encodings and final projection head to optimize the representation for ecological tasks, balancing stability and adaptation.
Key ingredients include:
- Optimizer: AdamW (learning rate ), stepwise LR decay (0.95 every 2 epochs)
- Batch size: 256
- Epochs: 60
- Data augmentation: random flip, rotation, center crop, color jitter
- Temperature: (InfoNCE), (WINCEL), selected by grid search
- Sentence bag size: up to (zero-padded as necessary)
- Implementation: PyTorch, OpenCLIP, trained on NVIDIA V100/A100 GPUs
5. Evaluation Protocols and Empirical Performance
Zero-shot ecosystem classification is evaluated on 25 EUNIS Level-2 habitat categories. During inference, the frozen vision encoder outputs a tile embedding, which is then compared (by cosine similarity) to the class name prompts’ textual embeddings; the highest similarity determines the predicted class. Metrics reported are overall accuracy (OA) and macro-averaged F1.
Baselines span:
- (a) Pretrained zero-shot
- (b) Fine-tuning on EcoWikiRS with vanilla InfoNCE
- (c) Fine-tuning with WINCEL
- (d) Fully supervised upper bound (SkyCLIP image encoder, cross-entropy on EUNIS labels)
Representative results:
| Backbone | OA Pretrained | OA InfoNCE | OA WINCEL | OA Supervised |
|---|---|---|---|---|
| CLIP | 14.7% | 25.3% | 30.9% | — |
| SkyCLIP | 19.2% | 27.1% | 30.1% | 52.7% |
WINCEL yields consistent gains over InfoNCE and baseline pretrained models, especially in the presence of label noise (Zermatten et al., 28 Apr 2025).
Ablation studies indicate:
- Habitat-section sentences outperform keyword, random, or species name sentences, highlighting the importance of domain-relevant textual cues.
- Alternative label noise reduction techniques (bootstrapping, top-k sampling, substring augmentation) offer only modest improvements over InfoNCE, remaining inferior to WINCEL.
- Fine-tuning both positional encodings and final projection head proves most effective.
Qualitative analyses show that WINCEL-trained models generate precise zero-shot similarity maps for prompts like “It prefers a warm and dry climate,” aligning well with geographic gradients or environmental features. In cross-modal retrieval, WINCEL models favor ecologically pertinent sentences for image tiles, in contrast to pretrained models’ preference for generic or irrelevant text.
6. Accessibility and Licensing
The EcoWikiRS dataset, curated preprocessing tools, training and evaluation code, and comprehensive reproduction instructions are available under an open-source license at:
https://github.com/eceo-epfl/EcoWikiRS
Users must adhere to the original data sources’ licenses: GBIF occurrences (Creative Commons Attribution Non Commercial 4.0 International) and Wikipedia text (standard Wikipedia free-use terms).
7. Significance and Implications
EcoWikiRS constitutes a large-scale, ecologically grounded vision–language resource with 91 000+ aerial tiles, 274 000+ species occurrences, and millions of curated habitat sentences. It is specifically configured for weak, noisy supervision and offers robust alignment between RS imagery and ecological language. The dataset enables and improves zero-shot ecosystem classification and interpretability for RS-VLMs in environmental applications. A plausible implication is that the proposed WINCEL approach can serve as a framework for other weakly supervised or noisy-label tasks in geospatial machine learning and ecological remote sensing, fostering ecologically meaningful modeling at continental scales (Zermatten et al., 28 Apr 2025).