TerraIncognita: Species Discovery & DG Benchmark

Updated 16 January 2026

TerraIncognita is a benchmark that integrates richly annotated insect images and camera-trap wildlife data to advance open-world species discovery and domain generalization.
It employs hierarchical taxonomy, dynamic dataset updates, and robust OOD detection metrics to rigorously evaluate frontier vision-language models on both known and novel taxa.
The benchmark addresses critical ecological challenges, supports astrophysical surface imaging, and fosters innovations in automated species identification.

TerraIncognita refers to a set of benchmarks, datasets, and methodologies applied in two principal domains: open-world insect species discovery using vision-LLMs (VLMs), and domain generalization in computer vision for camera-trap wildlife images. The term also appears in astrophysics literature concerning exoplanet surface imaging, but is most prominently established as a challenging domain generalization benchmark in machine learning and as the recent benchmark “TerraIncognita: A Dynamic Benchmark for Species Discovery Using Frontier Models”.

1. Ecological and Technical Motivation

Insect biodiversity faces a catastrophic decline, compounded by the rapid erosion of taxonomic expertise. Over 80% of insect species remain unnamed, despite their central functional roles in ecosystem resilience, pollination, and nutrient cycling. Traditional methods—expert trapping, morphological identification, and exhaustive literature review—are unsustainable and too slow for timely conservation actions. In parallel, frontier vision–LLMs, trained on vast multimodal corpora, display promising zero-shot learning, cross-domain generalization, and language-guided reasoning capabilities. TerraIncognita, as developed in the 2025 benchmark, operationalizes open-world species discovery: evaluating how frontier VLMs can automate identification of both known and novel insect taxa directly from images, with abstention for out-of-distribution discoveries (Chiranjeevi et al., 29 May 2025).

2. Dataset Design and Construction

TerraIncognita v1.0 comprises 437 high-resolution images (~2000×1900 px) representing 200 distinct insect specimens. The dataset is arbitrated into two balanced subsets:

Known species: 100 species (200 images), extracted from iNaturalist’s research-grade entries, hand-curated to cover eight Orders and 24 Families matching those in the “novel” set. Full taxonomic labels (Order → Family → Genus → Species).
Novel species: 100 specimens (237 images), collected via entomological light trapping in biodiversity hotspots (Ecuador, northern Brazil, Panama cloud forests). Each is imaged at least twice; reliable labels are provided only at coarse taxonomic ranks (Order or Family).

Hierarchical coverage in v1.0: known species span 8 Orders, 24 Families, 90 Genera, 100 species; novel taxa cover 8 Orders, 42 Families, 43 Genera, with just 12 species-level assignments (indicating open-world uncertainty). Metadata adopts Hugging Face Datasets schema, encoding both taxonomic depth and discovery status. To maintain research relevance, the dataset will be expanded and refreshed quarterly, with ~25% turnover per cycle, ensuring annual renewal and persistent benchmarking challenge (Chiranjeevi et al., 29 May 2025).

3. Formal Problem Statement and Evaluation Methodology

Let $x \in X$ denote an insect image and $T$ the taxonomy tree with ranks $L = \{1:\text{Order},\,2:\text{Family},\,3:\text{Genus},\,4:\text{Species}\}$ . Models $f$ map $x$ (with optional prompt) to hierarchical predictions $\hat{\mathbf{y}} = (\hat{y}_1, \ldots, \hat{y}_4)$ , each in $Y_i \cup \{\text{Unknown}\}$ :

$\hat{\mathbf{y}} = f(x, T), \qquad \hat{y}_i = \arg\max_l\,P(y_i = l | x) \text{ if confidence} \geq \tau_i; \text{ else } \hat{y}_i = \text{Unknown}.$

For OOD detection (novel discoveries), an open-world score $s(x)$ (e.g., negative max softmax, energy score) is used. Decision rule:

$s(x) < \delta$ : in-distribution, predict full taxonomy.
$s(x) \geq \delta$ : out-of-distribution, abstain by setting $\hat{y}_j = \text{Unknown}$ for all $j \geq k$ , where $k$ is the finest expert-provided label.

Metrics:

At each taxonomic rank $i$ :

$\text{Precision } P_i = \frac{TP_i}{TP_i + FP_i}; \quad \text{Recall } R_i = \frac{TP_i}{TP_i + FN_i}; \quad \text{F1}_i = 2\cdot\frac{P_i \cdot R_i}{P_i + R_i}$

For OOD detection, the ROC curve and AUROC are plotted as threshold $\delta$ is varied. Discovery accuracy measures the fraction of known images correctly classified and novel images correctly abstained at or above the fine rank.

4. Model Architectures, Experimental Results, and Systematic Failure Modes

Twelve advanced VLMs were benchmarked in strict zero-shot regime, using two prompts (taxonomy labels only vs. requiring justification):

Rank	Best F1 (Known)	Best F1 (Novel)	Discovery Accuracy (Novel)	AUROC (OOD)
Order	>90%	25–53%	87%+	0.65–0.88
Family	30–45%	4–16%
Genus	5–14%	—
Species	<2%	—

For known specimens, models such as GPT-4.1, Qwen2-VL, o3 reach $>$ 90% F1 at Order, with precipitous falloff to below 2% at Species. For novel taxa, conservative abstainers (Claude-3-Opus, Grok-2-Vision, LLaMA-4-Maverick) achieve peak discovery accuracy, whereas overconfident models tend to erroneously commit at fine tax ranks. AUROC values for OOD detection range 0.65–0.88.

When explanations are required, systematic reasoning failures arise: hallucination of invisible morphological features, inappropriate speculation, generic or biologically implausible justifications, and excessive taxonomic overreach. Requiring justification does not improve accuracy but exposes these error modalities (Chiranjeevi et al., 29 May 2025).

5. Domain Generalization in TerraIncognita Camera-Trap Wildlife Benchmark

In computer vision, TerraIncognita additionally designates a challenging domain generalization dataset featuring 10 animal classes photographed by motion-activated camera-traps at four distinct locations: L100, L38, L43, L46. Extremes of domain shift—variations in background, flora, weather, and camera angle—yield abrupt drops in accuracy when standard empirical risk minimization (ERM) models are deployed to unseen locations.

Recent DG algorithms demonstrate marked improvements:

Test-time Adaptation via Self-Training with Nearest Neighbor Information (TAST): Employs an ensemble of BN adaptation heads, prototypes, and nearest-neighbor embedding aggregation to smooth pseudo-labels and adapt at inference, yielding 42.64% accuracy (+2% vs ERM) (Jang et al., 2022).
Stylized Dream: Applies AdaIN-based style perturbation with consistency regularization to penalize texture bias, attaining state-of-the-art 51.1% accuracy (+5 pts over ERM) (Heo et al., 2023).
High-Rate Mixout: Integrates stochastic replacement of fine-tuned weights with pre-trained parameters ( $p=0.8$ for ResNet, $p=0.9$ for ViT), advancing OOD accuracy to 58.42% (ResNet-50) and 47.16% (ViT-S/16), with substantial computational savings (Aminbeidokhti et al., 8 Oct 2025).
Sharpness-Aware Gradient Matching (SAGM): Optimizes loss minimization and flatness (gradient alignment), achieving 48.8% accuracy on TerraIncognita (Wang et al., 2023).
MADG (Margin-based Adversarial Domain Generalization): Enforces classifier-aware margin alignment among all source-domain pairs, outperforming adversarial and invariant-risk baselines (53.7% accuracy) (Dayal et al., 2023).

6. Astrophysical Context: Light-Curve Inversion and Exoplanet Terra Incognita

In astrophysics, Exoplanet Terra Incognita refers to inverse cartography of unresolved exoplanets using multi-band photometric time series (“light curves”). The core methodology recovers surface albedo maps $a(\theta,\phi;\lambda)$ by regularized inversion:

$F(t;\lambda) = \int_{ \Omega_{\text{vis}}(t) } a(\theta,\phi;\lambda) P(\alpha(t),\theta,\phi) I_*(\lambda) \cos\zeta\, d\Omega$

Spherical harmonics expansion and regularized kernel inversion enable recovery of planetary terrains, spectral diagnostics of geology, biosignature detection (e.g., vegetation “red edge”), and the search for technosignatures (AMS). Application to proximate targets (e.g., Proxima b) requires SNR > 20–30 per phase sample and dense multi-phase coverage. Future gains are projected from 20–100 m class telescopes (ELF, LUVOIR), solar gravitational lens missions, and direct in situ imaging (Berdyugina et al., 2018).

7. Future Directions and Benchmark Evolution

For insect biodiversity discovery, TerraIncognita will undergo systematic quarterly updates, refreshing 25% of the dataset with new, field-collected specimens (estimated 400+ images per cycle) to sustain the open-world challenge and longitudinal benchmarking utility. All data, code, evaluation scripts, and logs are made available at https://baskargroup.github.io/TerraIncognita/ and on Hugging Face.

The evolving benchmark supports community model submissions and encourages submission of novel architectures in the strict zero-shot VLM regime. This regular expansion and dynamic difficulty advances both ecological research and frontier machine learning, facilitating precise measurement of progress in automated species discovery and biodiversity assessment (Chiranjeevi et al., 29 May 2025). For camera-trap generalization, future research is poised to exploit further advances in domain-adaptive learning, meta-regularization, and explainability to address unparalleled feature diversity and domain shift.