The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification

Published 19 Nov 2025 in cs.CV and cs.AI | (2511.15622v1)

Abstract: Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-LLMs for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at $\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the SA-FARI dataset, providing large-scale, dense, and manually verified segmentation masks for multi-animal tracking in wild settings.
The paper demonstrates rigorous benchmarking using modern vision-language segmentation models, yielding significant improvements in metrics like pHOTA and ℓ1.
The paper underlines the dataset's broad taxonomic diversity and ecological realism, pushing the frontier for robust, real-world conservation monitoring.

The SA-FARI Dataset: A New Benchmark for Multi-Animal Tracking in the Wild

Introduction

The SA-FARI dataset represents a significant milestone in the development of resources for multi-animal tracking (MAT) in unconstrained wildlife settings. Addressing a recurring limitation in current benchmarks—namely, their restricted scale, taxonomic breadth, ecological diversity, or annotation quality—SA-FARI provides large-scale, dense, and manually verified spatio-temporal segmentations for 99 animal species categories across 11,609 camera trap videos, sourced from 741 independent sampling sites on four continents over a ten-year period. This essay provides an in-depth analysis of the dataset's structure, methodological contributions, benchmarking results, and the broader implications for computer vision-based biodiversity monitoring.

Figure 1: Overview of the SA-FARI dataset, including global site distribution, species coverage, and dense manual annotation workflow.

Dataset Composition and Annotation Protocol

SA-FARI aggregates video data from various ecological projects, yielding over 46 hours of annotated footage capturing highly heterogeneous real-world scenarios: multi-animal scenes, frequent occlusions, dynamic behavior, diurnal and nocturnal conditions, a range of animal sizes, and partial/camouflaged views.

Each video is labeled exhaustively with species identities, date, time, and site metadata. An expert-driven protocol—initiated by automatic segmentation using SAM 3 and followed by multi-stage human verification and correction—produces high-fidelity segmentation masks (masklets) for every individual animal. The resulting 16,224 masklets are organized into train/test splits that maximize both taxonomic and site diversity with strict non-overlap to ensure robust benchmarking.

SA-FARI’s taxonomic breadth surpasses all prior MAT datasets, spanning most orders of Mammalia, substantial Aves representation, and notable inclusion of Reptilia and Amphibia. Its distribution exhibits marked long-tailed properties characteristic of natural settings: a small number of species dominate the sample count, while many rare species or open-set categories are only present in the test partition.

Figure 2: Visualization of taxonomic and geographic diversity within SA-FARI, highlighting abundance across major animal lineages and continents.

Figure 3: Species category distributions reveal the long-tailed structure typical of real-world eco-acoustic data, crucial for evaluating open-set model generalization.

The train/test split design further partitions the test set into challenging, night, multi-masklet, large-masklet, and small-masklet subsets, supporting granular model diagnostics across ecological and perceptual axes.

Comparative Dataset Features

SA-FARI is unique among animal tracking datasets for its combination of:

Scale: 11,609 videos, 741 sites, and nearly one million frame-level mask annotations.
Diversity: 99 annotated species categories, covering broad taxonomic groups and ecological zones.
Annotation Quality: All segmentation masks are manually reviewed, far exceeding earlier datasets that rely on noisy automated masks or bounding boxes alone.
Ecological Validity: Videos are sourced from in-situ camera traps, not online media or captive settings, maximizing real-world relevance and generalizability.

Compared to existing datasets such as AnimalTrack, GMOT, and specialized UAV collections, SA-FARI is at least 5× larger in total duration, covers over 2× as many animal categories, and uniquely combines multi-region, multi-habitat, and multi-species properties with exhaustive mask-based identity tracking.

Benchmarking and Model Evaluation

SA-FARI's utility for advancing MAT systems is demonstrated through extensive benchmarking. Evaluated models fall into two primary categories: modern vision-language promptable segmentation systems (e.g., SAM 3, GLEE, LLMDet) and traditional vision-only detection/tracking pipelines (e.g., MegaDetector + ByteTrack/OCSort/BoostSort++).

Species-Specific Prompting: When evaluated under the open-vocabulary, species-specific prompt setting using the Phrase-based HOTA (pHOTA), TETA, and classification-gated F1 (ℓ1) metrics, SAM 3—especially when fine-tuned on SA-FARI—substantially outperforms competitors, showing a +32.9 improvement on ℓ1, +19.6 on TETA, and +19.1 on pHOTA relative to its standard baseline. Absolute scores show high mask and association accuracy and sharply improved generalization to challenging, rare, and occluded scenarios.

Species-Agnostic Detection: Using generic "animal" prompts, SAM 3 (SA-FARI) again dominates, with a +23.9 IDF1 and +18.9 HOTA margin over its closest vision-only competitors.

Analytical Insights from Subset Evaluations

Analysis across labeled subsets reveals several diagnostic insights:

Small-object tracking remains the most challenging case for all models; large-masklet videos yield pHOTA scores +29.2 higher than small-masklet samples.
Occlusion and movement further degrade association metrics; the challenging subset shows a −9.3 pHOTA-Ass drop relative to average.
Multi-animal scenes are moderately more difficult, but the effect is mitigated by their correlation with larger, more salient masklets.
Nighttime sequences induce a modest performance decline, but high annotation quality and exhaustive test set labeling prevent significant metric collapse.
Figure 4: Distribution of key masklet-level metrics (size, IoU, occlusion) reveals the broad difficulty spectrum and supports detailed model evaluation.

Figure 5: Representative samples illustrate annotation density and scene complexity, including occlusions, re-identification, large groups, and heavily camouflaged instances.

Implications, Limitations, and Future Prospects

SA-FARI provides a pivotal resource for the training and rigorous evaluation of next-generation MAT systems, especially those seeking to operate in open-world, open-taxonomy ecological monitoring contexts. The demonstrated gains in state-of-the-art models confirm that current approaches underperform substantially on unconstrained data when deprived of diverse, high-quality supervision—SA-FARI directly enables robust improvements and realistic OOD performance measurement.

Nonetheless, the dataset is constrained by its current ecoregional coverage—additional biomes and underrepresented taxa (especially non-mammalian vertebrates and invertebrates) remain relevant expansion targets. While the annotation pipeline is exhaustive for visual data, accompanying audio channels, as well as animal pose, depth, and behavioral descriptors, remain untapped, presenting opportunities for future multi-modal model development.

Figure 6: Video and species distribution across SA-FARI ecoregions emphasizes sampling breadth and site redundancy prevention.

Figure 7: Ranked visualization of video contributions per sampling location, indicating strong geographic dispersion and split integrity.

Conclusion

SA-FARI establishes a new standard in open-source MAT datasets for wildlife computer vision, setting benchmarks with unprecedented scale, annotation rigor, taxonomic breadth, and ecological realism. It underlines the critical need for diverse, expertly validated training data to close the gap between laboratory metrics and real-world deployment in conservation technology. As the field advances, further integration of underrepresented regions, multi-modal signals, and fine-grained behavioral annotation—alongside synthetic and semi-automatic augmentation strategies—will be instrumental in constructing truly generalizable AI for large-scale biodiversity monitoring.

Markdown Report Issue