Open-World Ecological Taxonomy

Updated 16 January 2026

Open-world ecological taxonomy classification encompasses computational models that assign specimens to taxonomic categories using hierarchical structures and manage unknown species.
These frameworks combine embedding-based classifiers, vision-language alignment, and hyperbolic embeddings to tackle class imbalance, OOD detection, and long-tailed data distributions.
Practical applications include large-scale biodiversity monitoring, species discovery, and habitat assessment using benchmarks such as GlobalGeoTree and TerraIncognita.

Open-world ecological taxonomy classification is the set of computational frameworks and models that assign specimens to taxonomic categories under the dual constraints of hierarchical structure (e.g., order, family, genus, species) and open-set conditions—where many test inputs may belong to classes (taxa) never seen during training. This paradigm underpins large-scale biodiversity assessment, species discovery, and ecological monitoring, and must address long-tailed distributions, out-of-distribution (OOD) detection, class imbalance, and spatiotemporal domain shifts. The field synthesizes model architectures, learning objectives, open-set risk measures, and evaluation protocols to allow AI-driven systems to assign known labels when warranted, abstain or flag novelty appropriately, and—where possible—cluster or explain putative novel taxa.

1. Formal Problem Definition and Core Principles

Open-world ecological taxonomy classification requires simultaneous handling of (1) taxonomic hierarchy and (2) open-set recognition. Let $T$ be a taxonomy tree (ranks: phylum $\rightarrow$ class $\rightarrow$ order $\rightarrow$ family $\rightarrow$ genus $\rightarrow$ species), training set $D = \{(x_i, y_i)\}_{i=1}^N$ with $y_i \in K_{known} \subset T$ , and $K_{unknown} = T \setminus K_{known}$ .

At inference, each specimen $x$ must be assigned a set of hierarchical labels $(\hat y^{(1)}, ..., \hat y^{(L)})$ , where any $\hat y^{(l)}$ at level $l$ may be “Unknown.” Open-world models must (a) perform precise classification for $y \in K_{known}$ , (b) reject or abstain on $y \in K_{unknown}$ , and (c) often estimate confidence or uncertainty to inform partial or set-valued predictions.

Accuracy is measured at each rank $l$ via macro-averaged F1, and open-set detection is quantified by the true negative rate (TNR) at fixed true positive rate (TPR), open-set area under ROC (AUROC), or by “discovery accuracy”—the fraction of true unknowns correctly flagged as such (Low et al., 22 Dec 2025, Chiranjeevi et al., 29 May 2025).

2. Model Architectures and Learning Strategies

Embedding-based Open-Set Classifiers

Embedding encoders $\varphi(\cdot)$ (e.g., ResNet-101, ViT-B/16) map $x$ to $\ell_2$ -normalized $z \in \mathbb{R}^d$ .
Class prototypes $w_j$ define cosine logits $z_j = w_j^\top \hat x$ .
Margin-based architectures, e.g., AM-Softmax and its generalization (dual-margin penalization), enforce tight separation of rare (“tail”) from common (“head”) classes and calibrate open-set boundaries via class-and-sample-specific margins:

$L_{\text{ours}} = -\log \frac{\exp[s(z_y - m_y)]}{\exp[s(z_y - m_y)] + \sum_{k \neq y} \exp[s(z_k - m_k)]}$

with $m_y = m + \Delta_y$ , $m_k = \Delta_k$ , and $\Delta_j$ reflecting empirical class priors (Low et al., 22 Dec 2025).

Vision-Language and Multimodal Alignment

Bi-encoders (e.g., CLIP, TaxaBind, GeoTreeCLIP) align images, text (taxon descriptions), audio, location, and environmental variables in a shared space using contrastive InfoNCE or supervised contrastive loss (Sastry et al., 2024, Mu et al., 18 May 2025).
Multimodal patching and “locked” tuning propagate species-specific information across modalities while retaining generalization (Sastry et al., 2024).

Hyperbolic and Hierarchy-Preserving Embeddings

Hyperbolic neural networks (Poincaré ball or Lorentz model) leverage geometry matching the exponential growth of taxonomic trees. Entailment-cone loss enforces explicitly the parent–child containment relationships, stacking constraints across ranks (stacked entailment loss):

$L_{\mathrm{SEL-intra}} = \frac{1}{\sum_{r=2}^R \mathbbm{1}_r} \sum_{r=2}^R \mathbbm{1}_r\, \mathrm{ent}(T_r, T_{r-1})$

(Gong et al., 22 Aug 2025).

Retrieval-Augmented Generation (RAG) and Explanation Pipelines

Dense image captioning converts $x$ to “biocaptions” with explicit morphological descriptors.
Retrieval augments LLM-based classifiers by pulling from large, filtered biodiversity text corpora (Wikipedia, Wikispecies), then reasons over retrieved context plus visual caption, producing (1) classification with abstention and (2) evidence attribution (Lesperance et al., 13 Mar 2025).
Confidence thresholds at each rank dynamically route uncertain cases to RAG modules, reducing overconfidence on the long tail.

3. Datasets, Benchmarks, and Evaluation Protocols

Static and Dynamic Benchmarks

GlobalGeoTree: 6.26M Sentinel-2 time-series samples, labels over 275 families, 2,734 genera, 21,001 species; pretraining/evaluation splits with systematically stratified rarity (Mu et al., 18 May 2025).
EcoWikiRS: ~91K tiled high-resolution images with species co-occurrence (GBIF) and >19M habitat sentences from Wikipedia; evaluates zero-shot EUNIS habitat classification (Zermatten et al., 28 Apr 2025).
TerraIncognita: Dynamic, multi-release benchmark combining 237 “novel” insect taxa (field-collected) and 200 “known” taxa (iNaturalist), with quarterly expansion—explicitly tracking open-world, OOD detection, and hierarchical accuracy (Chiranjeevi et al., 29 May 2025).
LifeCLEF 2016: 110K+ plant images, 1,000 known classes, large-scale open-set evaluation (“mAP-open,” “mAP-open-invasive”), with distractor “unknown” samples (Goeau et al., 25 Sep 2025).
BioSCAN-1M: Used for hyperbolic multimodal taxonomy embedding (Gong et al., 22 Aug 2025).

Evaluation Metrics

Metric	Definition	Application
Rank-1 Accuracy	Correct assignments on known classes	Species/genus/family
Macro Recall	Average recall over classes (head/tail-weighted equally)	Long-tailed domains
Open-set AUROC/TNR	Discriminative power for “unknown” inputs	Open-world detection
Discovery Accuracy	Fraction of true unknowns correctly labeled “Unknown”	(Chiranjeevi et al., 29 May 2025)
mAP-open	Mean Average Precision on known, penalizing false positives on unknowns	(Goeau et al., 25 Sep 2025)
Abstention Rate	$\text{Abst}_\ell = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[ŷ_i^\ell = \text{Unknown}]$	(Chiranjeevi et al., 29 May 2025)

In addition, composite hierarchical F1, cross-domain transfer, and explanation-alignment scores are reported where available (Low et al., 22 Dec 2025, Chiranjeevi et al., 29 May 2025).

4. Open-World Discovery, Unknown Detection, and Abstention

Open-world frameworks reject or abstain on unfamiliar taxa using several mechanisms:

Thresholding: Confidence $\max_j p_j(x)$ is compared to a threshold $\tau$ optimized for TPR on knowns; inputs below $\tau$ are labeled “Unknown” (Low et al., 22 Dec 2025, Goeau et al., 25 Sep 2025).
Partial Reject/Set-Valued Bayes: Classifiers return a set of plausible classes whose posteriors exceed a relative cutoff; the empty set signals outlier (“novel taxon”) (Karlsson et al., 2019). Posterior predictive tail probabilities serve as Bayesian $p$ -values for additional outlier protection.
Retrieval-Augmented Abstention: When a closed-set VLM’s confidence falls below learned thresholds (e.g., $\tau_{vlm}(\text{Family})$ ), the RAG module produces “Unknown” at any rank where evidence or confidence falls short (Lesperance et al., 13 Mar 2025).
Clustering for Class Discovery: Rejected examples are grouped via learned pairwise metrics (PCN), typically hierarchical complete-linkage or k-means, using validation to set the merge cutoff $\theta$ (Shu et al., 2018).

Key tradeoffs include calibration of rejection (to minimize over/under-abstention), robustness to class imbalance, and interpretability—flagging which explicit traits drive “unknown” assignments.

5. Multimodal, Hierarchy-aware, and Foundation Model Approaches

Recent advances include:

Multimodal Foundation Models: TaxaBind aligns ground images, satellite, audio, location, and environmental vectors in a 512-D space, unlocking zero-shot retrieval, spatial queries, and cross-modal downstream tasks. Multimodal patching ensures that each modality preserves unique species-relevant information (Sastry et al., 2024).
Hyperbolic Architectures: Imposing hyperbolic geometry enables nesting and separation matching tree-structured taxonomies, outperforming Euclidean baselines in unseen species retrieval, especially via DNA barcodes (Gong et al., 22 Aug 2025).
Open-vocabulary object detectors: OpenWildlife generalizes Grounding-DINO for open-vocabulary detection, using multilingual BERT and region–token contrastive losses to seamlessly handle user-specified species names, including those unseen during training (Patel et al., 24 Jun 2025).
Limitations of Generalist VLMs/LLMs: Foundation models such as GPT-4o, Gemini, and BioCLIP display high accuracy at coarse-taxonomic (order) levels but collapse at species granularity (<2% F1), failing to generalize to fine-grained, rare, or entirely novel taxa—a pattern rigorously demonstrated in dynamic benchmarks like TerraIncognita (Chiranjeevi et al., 29 May 2025, Low et al., 22 Dec 2025).

6. Practical Applications, Tradeoffs, and Future Directions

Open-world ecological taxonomy classification propels automated biodiversity monitoring, in situ discovery of novel species, large-scale habitat assessment, and conservation prioritization. Notable findings and design choices include:

Simple softmax-threshold rejection is competitively effective for moderate open-set ratios; specialized (EVT/OpenMax) recalibrators are needed for higher novelty rates (Goeau et al., 25 Sep 2025).
Simultaneous handling of taxonomic hierarchy, long-tailed distributions, and domain shift is essential; dual-margin penalization and norm-guided sampling achieve state-of-the-art recall for rare taxa (Low et al., 22 Dec 2025).
Active learning, dynamic benchmark expansion (Chiranjeevi et al., 29 May 2025), and incorporation of additional modalities (e.g., LiDAR, SAR, DNA barcoding) will further enhance species discovery and confidence in real-world applications.
Frameworks that jointly optimize hierarchical accuracy, OOD detection, and explanation alignment—while minimizing overcommitment—set the direction for next-generation AI-guided taxonomy.

Future research directions emphasize hierarchy-aware losses (e.g., tree-distance penalties), uncertainty quantification, and foundation models explicitly tuned to ecological ontologies and fine-grained morphological expertise (Low et al., 22 Dec 2025, Gong et al., 22 Aug 2025, Mu et al., 18 May 2025, Chiranjeevi et al., 29 May 2025).

Major references: (Low et al., 22 Dec 2025, Sastry et al., 2024, Mu et al., 18 May 2025, Zermatten et al., 28 Apr 2025, Lesperance et al., 13 Mar 2025, Gong et al., 22 Aug 2025, Chiranjeevi et al., 29 May 2025, Goeau et al., 25 Sep 2025, Karlsson et al., 2019, Shu et al., 2018, Patel et al., 24 Jun 2025)