Papers
Topics
Authors
Recent
Search
2000 character limit reached

iNaturalist-2021: Species Benchmark Dataset

Updated 28 January 2026
  • The iNaturalist-2021 Dataset is a large-scale, fine-grained image collection with about 2.7 million images across 10,000 species and rich taxonomic metadata.
  • It features community-curated, research-grade annotations with 85–95% accuracy, ensuring reliable labels despite inherent citizen science noise.
  • The dataset benchmarks various transfer learning approaches—including supervised, self-supervised, and vision–language models—addressing long-tail and fine-grained classification challenges.

The iNaturalist-2021 (iNat2021) dataset is a large-scale, fine-grained, image-based classification collection curated to benchmark species recognition tasks in computer vision and machine learning. Derived from the citizen science iNaturalist platform, iNat2021 comprises approximately 2.7 million images encompassing 10,000 species-level classes, with a focus on macro-organisms spanning Animalia, Plantae, Fungi, and other major taxonomic kingdoms. Its primary role is as a benchmark for transfer learning—including supervised, self-supervised, and vision–language approaches—in the context of ecologically and agriculturally relevant species detection (Nakkab et al., 2023, Horn et al., 2021).

1. Dataset Scope, Composition, and Structure

iNat2021 constitutes the largest fine-grained species dataset to date, offering the following key attributes:

  • Scale: ~2,700,000 images, 10,000 classes (species-level granularity).
  • Taxonomic hierarchy: Each class represents a unique species, with accompanying metadata capturing scientific (binomial) names, common names, and higher-level taxonomy (order, family, genus).
  • Kingdoms covered: Animalia, Plantae, Fungi, and other macro-organisms.
  • Data splits: The dataset adheres to a time-based split protocol, allocating images to training, validation, and test sets by observation date:
    • Training: 2,686,843 images
    • Validation: 100,000 images
    • Test: 500,000 images
  • Per-class annotation statistics: The full split maintains a minimum of 152 and a maximum of 300 training images per class (average ≈267), mitigating the extreme long-tail bias in the broader iNaturalist archive (Horn et al., 2021). A “mini” variant samples 50 images per class.
Property Value
Total images ~2,700,000
Number of species 10,000
Average images/class ~270 (min 152, max 300)
Taxonomic scope Animalia, Fungi, Plantae
Metadata per class Common/scientific name, higher taxonomy

Table: iNaturalist-2021 summary (Nakkab et al., 2023, Horn et al., 2021)

All images derive from the iNaturalist platform's "research-grade" observations, having reached community consensus on species identification (Horn et al., 2021). Metadata fields for each image include geolocation (latitude, longitude), timestamp, observer identifier, and standardized taxonomic paths (Horn et al., 2021).

2. Annotation Quality and Metadata

Annotation in iNat2021 leverages the intrinsic curation processes of the iNaturalist platform:

  • Labeling process: Species assignments are crowd-sourced and undergo peer review, with "research-grade" denoting consensus accuracy. Observations flagged as uncertain or ambiguous are either removed or explicitly marked.
  • Quality control: No additional human verification is performed for iNat2021. Pilot studies estimate the accuracy of "research-grade" assignments at ~85–95%, with some residual label noise remaining (Horn et al., 2021).
  • Metadata: Rich per-image information supports auxiliary research, including geospatial modeling, temporal patterns, observer biases, and federated learning paradigms (Horn et al., 2021).
  • Taxonomic treatment: Subspecies and varieties are merged to the species level to maximize inter-class discriminability and consistency (Horn et al., 2021).

A plausible implication is that label noise and systematic imperfections in citizen science observations introduce real-world complexity valuable for training robust models.

3. Data Distribution and Challenges

The iNat2021 dataset reflects the fractious, long-tailed structure of natural world observations:

  • Class imbalance: While the main split limits extremes (min 152, max 300 images/class), prior versions display a heavy long-tail (many classes <100 images, few >10,000) (Horn et al., 2021, Nakkab et al., 2023).
  • Sampling: No explicit rebalancing is performed in the distributed dataset; models must account for class frequency disparities natively.
  • Variability: The dataset represents a spectrum of image qualities—from high-resolution DSLR captures to low-resolution smartphone outputs. Environmental conditions, occlusions, and intra/interclass visual similarity pose substantial classification and generalization challenges.
  • Fine-grainedness: Many species are visually differentiated by subtle morphological cues (color patterns, shapes, structures), increasing the difficulty for representation learning and standard softmax classification (Nakkab et al., 2023).

This long-tailed, fine-grained regime motivates research emphasizing transfer learning, robust contrastive methods, label smoothing, cost-sensitive losses, and advanced data augmentation.

4. Dataset Adaptations and Vision–Language Integration

Unlike many image collections, iNat2021 lacks native per-image natural language captions. To exploit contrastive vision–LLMs, deterministic synthetic captions are generated automatically using class-level metadata:

  • Caption generation protocol (Nakkab et al., 2023):

    1. Aggregate species metadata (common/scientific name, taxonomic ranks).
    2. Select the subset of fields maximizing class discriminability.
    3. For each image of class cc, emit a template caption: “A photo of a ⟨CONCAT_METADATA(c)⟩.”
  • Examples: “A photo of the Buff-tailed Coronet Birds Boissonneaua flavescens”; “A photo of the Common Blue Crab Animalia Callinectes sapidus.”

  • Vision–language packaging: Image-caption pairs are stored in WebDataset format, with captions tokenized using a CLIP-style encoder. The resulting representations are order-invariant and optimized for discriminative learning (Nakkab et al., 2023).
  • Modeling implications: The LiT (“Locked-image Text” tuning) methodology freezes the pretrained vision encoder while learning text alignment, reducing over-fitting and computational cost on long-tailed data.

This pipeline makes iNat2021 well-suited for benchmarking zero-shot classification, vision–language transfer, and domain-adapted representation learning (Nakkab et al., 2023).

5. Evaluation Protocols and Baseline Metrics

Species classification performance on iNat2021 is evaluated with canonical metrics:

  • Top-K accuracy: For a held-out set of NN images with ground-truth labels yiy_i, and top-kk predicted classes Topk(xi)\mathrm{Top}_k(x_i), metrics are:
    • Top-1: A1=1Ni1[yiTop1(xi)]A_1 = \frac{1}{N}\sum_i \mathbb{1}[y_i \in \mathrm{Top}_1(x_i)]
    • Top-5: A5=1Ni1[yiTop5(xi)]A_5 = \frac{1}{N}\sum_i \mathbb{1}[y_i \in \mathrm{Top}_5(x_i)]
  • Per-group mean accuracy: Accuracy computed across “iconic groups” (e.g., Insects, Fungi, Plants).

Baselines reported on iNat2021 (Horn et al., 2021) (ResNet-50, ImageNet-pretrained):

Split Top-1 Accuracy Top-5 Accuracy
Full iNat2021 0.760 0.914
Mini (50/class) 0.654 0.851

Self-supervised and transfer learning results indicate standard supervised features currently outperform recent self-supervised approaches such as SimCLR, although continuous improvement from new methods is anticipated (Horn et al., 2021).

iNat2021 builds upon and far exceeds earlier iNaturalist releases in both scale and representational depth:

  • iNat2017: 859,000 images, 5,089 species; displays extreme long-tail and lower overall image quality (Horn et al., 2017).
  • iNat2021 vs. iNat2017: iNat2021 is ≈5–6× larger, defines splits based on observation dates, offers improved geographic and taxonomic coverage, and more balanced per-class sample regimes (Horn et al., 2021).
  • Semi-iNat 2021: A derived dataset for semi-supervised learning at scale—810 labeled species (L_in), 1,629 “out-of-class” unlabeled species (U_out), totaling ≈330,000 images; incorporates domain shifts and coarse taxonomic supervision to model open-set and semi-supervised regimes (Su et al., 2021).
  • NeWT suite: Designed for downstream transfer learning tasks (behavior, health, context) using iNat2021 pretraining (Horn et al., 2021).

The dataset continues to expand, with iNaturalist platform growth allowing annual refreshes by extending the date threshold of splits; a plausible implication is continued utility for future benchmarking as open-world recognition challenges evolve (Horn et al., 2021, Nakkab et al., 2023).

7. Applications and Limitations

iNat2021 supports a range of research and applied domains:

  • Benchmarking: Provides a challenging testbed for large-scale, fine-grained classification, transfer and representation learning, including self-supervised, supervised, and vision–LLMs.
  • Ecological and agricultural research: Enables robust assessment of automated species detection methods in real-world, long-tailed application settings (Nakkab et al., 2023).
  • Geo-contextual and federated modeling: Rich metadata encourages geospatial experiments and analysis using federated multi-user scenarios (Horn et al., 2021).
  • Caveats: Label noise (80–95% accuracy in “research-grade”), exclusion of per-image natural language, absence of multi-label for co-occuring species, and natural label distribution variability must be considered when designing learning protocols (Horn et al., 2021).

The dataset is publicly accessible for academic research, subject to contributor-specified Creative Commons licensing via the iNaturalist platform (Horn et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to iNaturalist-2021 Dataset.