Animal Re-ID Benchmarks Overview

Updated 29 December 2025

Animal re-ID benchmarks are rigorously curated datasets featuring diverse species, standardized splits, and tailored evaluation metrics for fine-grained identification.
They employ advanced methodologies such as deep metric learning, pose-aware models, and foundation models to overcome challenges in variable field conditions.
These benchmarks enable reliable performance comparisons, open-set evaluations, and actionable insights for conservation, livestock management, and biodiversity research.

Animal re-identification (re-ID) benchmarks structure and catalyze progress in automated wildlife monitoring by providing rigorously curated datasets, standardized evaluation protocols, and reproducible baselines for algorithmic comparison. These benchmarks are characterized by species and acquisition diversity, carefully formulated performance metrics, and an evolving suite of domain-tailored methodologies that reflect the unique difficulties of fine-grained animal identification under variable field conditions.

1. Scope and Composition of Animal Re-ID Benchmarks

Benchmarks for animal re-identification encompass a wide taxonomic, ecological, and technical range. Datasets vary from highly controlled, small-scale collections for a single species—e.g., the 2,080-image, 57-individual SealID set of endangered Saimaa ringed seals (Nepovinnykh et al., 2022)—to massive multi-species compendia spanning tens of thousands of individuals. Recent efforts prioritize taxonomic scope, cross-domain images, and annotation quality:

Benchmark	Individuals	Images	Species Count	Notable Features
SealID	57	2,080	1	Fine-grained pelage patterns
ATRW	92 (182*)	3,649	1	Bounding boxes, pose, ID
PetFace	257,484	1,012,934	13 families	Extensive breed/taxonomy labels
WildlifeReID-10k	10,344	214,262	~33	30 curated source sets, open/closed
MiewID (Wildbook)	37,138	225,374	49	Cross-curated, production-proven

Editor's term: "*entity" in ATRW refers to left/right flanks as distinct IDs.

Key dataset characteristics include: (i) number of identities and samples per identity; (ii) acquisition context (camera-traps, video, crowdsourcing); (iii) sample diversity (pose, illumination, environment); (iv) manual and automated annotation of identities, regions, and viewpoints; and (v) explicit protocols to avoid annotation leakage and background bias.

2. Evaluation Protocols and Metrics

Re-ID benchmarks define standardized data splits and performance metrics that reflect ecological realities and prevent overfitting to trivial cues:

Splits: Gallery vs. Query and closed-set vs. open-set. For closed-set protocols, query individuals always appear in the gallery (e.g., SealID, ATRW); in open-set, queries may include novel individuals (WildlifeReID-10k, PetFace).
Leakage prevention: Similarity-aware (DBSCAN in DINOv2 space), time-aware (chronological), and disjoint-ID splits are preferred over naive random assignment to prevent temporal or visual near-duplicates straddling train/test and artificially inflating scores (Adam et al., 2024, Čermák et al., 2023).
Metrics:
- Cumulative Matching Characteristic (CMC, R@k): Proportion of queries for which the correct identity is retrieved among top-k matches.
- Mean Average Precision (mAP): Averages the precision-recall curve per query; defined as:
$\mathrm{mAP} = \frac{1}{Q} \sum_{q=1}^Q \mathrm{AP}_q, \quad \mathrm{AP}_q = \sum_{k=1}^{N} P_q(k)\,\Delta r_q(k)$ - Balanced Accuracy (BAKS/BAUS/NBA): Per-identity balanced recall for known and unknown individuals in open-set protocols (Adam et al., 2024). - AUC (Area Under ROC): For verification tasks, especially open-set or unseen identity cases (PetFace). - Precision@K and Top-k: For retrieval on large galleries.

Protocols may additionally standardize segmentation IOU for masked re-ID (SealID), impose camera or viewpoint splits (ATRW), or control for background and environmental confounds (Yu et al., 2024).

3. Representative Datasets and Their Roles

Several animal re-ID datasets serve as canonical testbeds for different methodological and ecological challenges:

SealID (Nepovinnykh et al., 2022): Focuses on subtle, deformable, low-contrast ring patterns in freshwater seals. Fixed gallery/query split, segmentation masks, and patch datasets support both retrieval and segmentation tasks. Demonstrates pre-processing (segmentation/tone mapping) can yield substantial improvements (>28pp gain for NORPPA algorithm after preprocessing).
ATRW (Li et al., 2019): Large-scale, pose-rich dataset with full annotation for detection, pose, and re-ID. Used for multimodal benchmarking (ResNet, metric learning, pose-part models), with cross-camera and single-camera splits; high variety in field conditions.
PetFace (Shinoda et al., 2024): Over 1 million aligned animal faces, 257k individuals in 13 families, rich breed and attribute labels. Enables both closed-set re-ID and open-set verification benchmarks.
WildlifeReID-10k (Adam et al., 2024): Aggregates 30 datasets for 10,344 animals (20+ species), imposing robust splits (time-/similarity-aware), supporting closed/open-set, enabling balanced-accuracy measures and rigorous generalization tests.
MiewID (Wildbook, (Otarashvili et al., 2024)): Community-curated, 49 species, direct joint training and zero-shot evaluation of multi-species deep re-ID. Demonstrates large improvements (+12.5pp Top-1 vs. single-species, +19.2pp vs. MegaDescriptor).

These datasets supply differentiated testbeds for background bias, pose invariance, scale, taxonomic breadth, and challenging “tail” identities with few samples.

4. Baseline and State-of-the-Art Methods

Benchmarks anchor the development and fair evaluation of re-ID models. Widely used baselines span classical local descriptors, pretrained/foundation models, and modern deep metric learning:

Local Feature Pipelines: SIFT, Superpoint, RootSIFT with spatial verification (HotSpotter for SealID); can be competitive on trivial or highly distinctive datasets, but generally outperformed by learned descriptors (Čermák et al., 2023).
Metric-Learning Backbones: Triplet loss (batch hard, semi-hard negative mining), ArcFace (angular margin), Circle loss; typically ResNet, Swin, or EfficientNetV2 backbones. ArcFace + Swin outperforms other configurations in both median/variance Rank-1 (Čermák et al., 2023).
Foundation Models: MegaDescriptor (multi-dataset pre-trained Swin transformers), MiewID (EfficientNetV2, ArcFace), DINOv2, CLIP ViT; distinction in cross-dataset generalization, with MegaDescriptor and MiewID outperforming DINOv2/CLIP (e.g., +20–70pp Top-1 on challenging datasets).
Part-based and Pose-Aware Models: Body-part fusion (PPbM for tigers (Li et al., 2019), DVE for background/part invariance (Yu et al., 2024)); performance gains of 4–5pp mAP on cross-camera settings and increased robustness to pose/deformation.
Vision-Language and Semantic Augmentation: IndivAID leverages cross-modal CLIP and text conditioning for identity-driven discrimination, consistently outperforming CLIP-ReID on eight benchmarks (largest gain +14pp mAP on LionData) (Wu et al., 2024).
Edge/Resource-Constrained Models: Compressed MobileNetV2 for animal re-ID on microcontrollers (INT8, 64×64 input, <100KB parameters), maintaining retrieval within 2–3pp of baseline teacher models (Chen et al., 9 Dec 2025).

Strong baselines (MegaDescriptor, ARBase, IndivAID) further clarify methodological progress and highlight unresolved robustness issues in open-set, cross-domain, and sample-poor regimes.

5. Protocol Design Choices and Their Implications

Key consensus and open questions in benchmark design include:

Splitting methodology: Random splits are prone to performance inflation via temporal or spatial leakage (near-duplicate or co-encounter images in train/test). Time- or similarity-aware splits are favored for ecological realism (Adam et al., 2024).
Species composition: Multi-species benchmarks (WildlifeReID-10k, MiewID) drive the development of universal and more generalizable embedding models, particularly benefiting rare/under-sampled taxa (+12.5pp Top-1 over species-specific training (Otarashvili et al., 2024)).
Evaluation regime: Closed-set benchmarks assess retrieval among known IDs; open-set protocols explicitly test identification of unseen individuals and anomaly rejection.
Metrics: Balanced accuracy (BAKS/NBA) is critical for large, imbalanced benchmarks with many “tail” identities; mAP and R@k remain standard for pure retrieval settings.

A plausible implication is that standardized, robust split protocols and per-identity balanced evaluation are necessary for progress toward realistic, scalable, and transferable animal re-ID.

6. Practical Impact, Limitations, and Future Directions

Animal re-ID benchmarks ground advances in biodiversity monitoring, conservation, and livestock management. Notable impacts and limitations are as follows:

Impact: Accelerated field deployment (SealID for endangered seals, MiewID for >60 species in production), algorithmic benchmarks for active learning/annotation efficiency (AAS/NP3 improves mAP by +4–11% with <0.05% label budget (Sani et al., 10 Nov 2025)), and hardware-constrained re-ID illustrated for on-collar use (Chen et al., 9 Dec 2025).
Limitations:
- Class/sample imbalance: Even large benchmarks (WildlifeReID-10k) remain dominated by single-image individuals.
- Viewpoint and domain coverage: Some species or mark types (e.g., dorsal vs fluke) yield zero identification in cross-view, limiting current model generalizability.
- Semantic annotation and meta-data: Progress needed in integrating sex, age, behavior, and environmental context.