ViTD2PC24All: Fine-Tuned Vision Transformer Model
- ViTD2PC24All is a fine-tuned Vision Transformer tailored for zero-shot, multi-species plant identification using a 24-layer ViT backbone and 14×14 patch embeddings.
- The model employs a tile-based inference strategy with patch partitioning and Bayesian reweighting based on unsupervised visual clustering to enhance prediction accuracy.
- Achieving a macro-F1 score of 0.348 in PlantCLEF 2025, the pipeline demonstrates robust domain-prior adaptation without relying on labeled fine-tuning of the evaluation split.
ViTD2PC24All is a fine-tuned variant of the Vision Transformer architecture, specifically adapted for zero-shot, multi-species plant identification in high-resolution quadrat imagery. Employed by DS@GT in the PlantCLEF 2025 challenge, the pipeline integrates a 24-layer ViT backbone, a tile-based patch inference mechanism, and domain-prior adaptation via unsupervised visual clustering. This model and its inference strategy achieved second place in the PlantCLEF 2025 challenge with a macro-averaged F1 of 0.348 on the private leaderboard, relying entirely on the pre-trained network and Bayesian adaptation from test-set statistics, without any labeled fine-tuning on the evaluation split (Gustineli et al., 8 Jul 2025).
1. Model Architecture and Fine-Tuning Protocol
The ViTD2PC24All pipeline is built upon the “timm/vit_large_patch14_dinov2.lvd142m” Vision Transformer, comprising 24 transformer encoder layers (depth = 24), a hidden dimension , and attention heads (per-head dimension 64). The feed-forward block, or MLP, has size $4d=4096$. The image is partitioned into pixel patches, yielding patches per pixel input.
Patch embeddings use a learnable matrix , and a learnable [CLS] token is appended with 1D position embeddings . The resulting token matrix is processed by the transformer stack, and the output [CLS] vector is dispatched to the classifier head.
Under the “All” fine-tuning regime, both the backbone and the classifier head are tuned on 1.31 million images covering 7,806 species from the PlantCLEF 2024 dataset. The classifier is a linear layer followed by a softmax. Training uses cross-entropy loss, AdamW optimizer with cosine LR decay and linear warmup, and standard augmentations (rand-augment, random cropping, dropout, and weight decay). Validation and test splits each contain around 50,000 images.
2. Tile-Based Inference and Patch Handling
The test images, typically ~2000×2000 pixels and containing multiple plant species per field, are decomposed into non-overlapping tiles. Each tile is pixels, resized to for compatibility with the network’s input. This alignment preserves the original patch architecture and receptive field, ensuring no loss of discriminative spatial detail.
For a given tile , the sequence of patch-embedded tokens is:
This is processed through the 24 transformer layers:
The final [CLS] token is classified via:
For each tile, the top-K predicted classes (typically ) are retained as candidate species.
3. Visual-Cluster Priors and Bayesian Inference
To address label imbalance and exploit spatial/geographic relationships in the dataset, the pipeline estimates unsupervised priors for reweighting predictions.
- PaCMAP + K-Means Clustering: Each full quadrat’s [CLS] embedding is reduced to 2D using PaCMAP and grouped into clusters via K-Means. Each “region” in the test set is assigned to the cluster containing the majority of its images.
- Empirical Cluster Priors: For each cluster , a prior probability distribution over species is computed:
where indexes images in cluster ; is the average prediction across that image’s tiles.
- Tile Prediction Reweighting: For tile in cluster , the adjusted per-class posterior is given by:
Renormalization ensures a valid probability distribution.
4. Aggregation and Multi-Species Decision Logic
Species predictions are aggregated across tiles to produce final image-level decisions:
- For each tile , the top-K species under are selected as “votes.”
- Across all $16$ tiles, voted species counts are tabulated.
- The final prediction for the entire quadrat consists of the most-frequently voted species, with set by optimizing macro-F1 on the validation set.
This hierarchical, majority-vote mechanism allows fine-grained spatial distinctions within large images and incorporates location-dependent priors.
5. Evaluation, Ablations, and Key Insights
Ablation results reported on the PlantCLEF 2025 private leaderboard demonstrate the impact of each pipeline component. Macro-F1 scores are summarized in the following table:
| Method | Top-K | Tiles | Private Macro-F1 | Public Macro-F1 |
|---|---|---|---|---|
| ViT-only | 20 | — | 0.0063 | 0.0116 |
| ViT + 4×4 tiling | 20 | 4×4 | 0.2631 | 0.2524 |
| ViT + 4×4 tiling | 9 | 4×4 | 0.3442 | 0.3081 |
| ViT + GeoFilter | 10 | 4×4 | 0.3449 | 0.3160 |
| ViT + ClusterPriors | 9 | 4×4 | 0.3483 | 0.2929 |
Key findings:
- Full-image inference (single ViT pass) yields near-random performance.
- Matching tile size to the ViT’s receptive field is essential; tiling alone dramatically increases F1.
- Clustering-based Bayesian priors yield further improvements at negligible computational cost.
- Geolocation filtering provides an alternative but does not combine with visual clustering in the reported ablations.
6. Accessibility and Reproducibility
All source code, configuration files, and reproducibility scripts for the ViTD2PC24All pipeline are open-sourced and available at [github.com/dsgt-arc/plantclef-2025]. The implementation adheres to the official PlantCLEF 2024 training recipe and employs standard Vision Transformer training and inference frameworks. No additional supervised training is required for domain-adaptation components, which rely solely on clustering and reweighting derived from the unlabeled test set (Gustineli et al., 8 Jul 2025).