ViTD2PC24All: Fine-Tuned Vision Transformer Model

Updated 21 January 2026

ViTD2PC24All is a fine-tuned Vision Transformer tailored for zero-shot, multi-species plant identification using a 24-layer ViT backbone and 14×14 patch embeddings.
The model employs a tile-based inference strategy with patch partitioning and Bayesian reweighting based on unsupervised visual clustering to enhance prediction accuracy.
Achieving a macro-F1 score of 0.348 in PlantCLEF 2025, the pipeline demonstrates robust domain-prior adaptation without relying on labeled fine-tuning of the evaluation split.

ViTD2PC24All is a fine-tuned variant of the Vision Transformer architecture, specifically adapted for zero-shot, multi-species plant identification in high-resolution quadrat imagery. Employed by DS@GT in the PlantCLEF 2025 challenge, the pipeline integrates a 24-layer ViT backbone, a tile-based patch inference mechanism, and domain-prior adaptation via unsupervised visual clustering. This model and its inference strategy achieved second place in the PlantCLEF 2025 challenge with a macro-averaged F1 of 0.348 on the private leaderboard, relying entirely on the pre-trained network and Bayesian adaptation from test-set statistics, without any labeled fine-tuning on the evaluation split (Gustineli et al., 8 Jul 2025).

1. Model Architecture and Fine-Tuning Protocol

The ViTD2PC24All pipeline is built upon the “timm/vit_large_patch14_dinov2.lvd142m” Vision Transformer, comprising 24 transformer encoder layers (depth = 24), a hidden dimension $d=1024$ , and $h=16$ attention heads (per-head dimension 64). The feed-forward block, or MLP, has size $4d=4096$. The image is partitioned into $14\times14$ pixel patches, yielding $N_p=37^2=1,369$ patches per $518\times518$ pixel input.

Patch embeddings use a learnable matrix $E \in \mathbb{R}^{(14\cdot14\cdot3)\times d}$ , and a learnable [CLS] token is appended with 1D position embeddings $E_\text{pos} \in \mathbb{R}^{(N_p+1)\times d}$ . The resulting token matrix is processed by the transformer stack, and the output [CLS] vector $z_{24}[0]$ is dispatched to the classifier head.

Under the “All” fine-tuning regime, both the backbone and the classifier head are tuned on 1.31 million images covering 7,806 species from the PlantCLEF 2024 dataset. The classifier is a linear layer $W_\text{cls} \in \mathbb{R}^{d \times 7,806}$ followed by a softmax. Training uses cross-entropy loss, AdamW optimizer with cosine LR decay and linear warmup, and standard augmentations (rand-augment, random cropping, dropout, and weight decay). Validation and test splits each contain around 50,000 images.

2. Tile-Based Inference and Patch Handling

The test images, typically ~2000×2000 pixels and containing multiple plant species per field, are decomposed into $4\times4$ non-overlapping tiles. Each tile is $\sim500\times500$ pixels, resized to $518\times518$ for compatibility with the network’s input. This alignment preserves the original patch architecture and receptive field, ensuring no loss of discriminative spatial detail.

For a given tile $t$ , the sequence of patch-embedded tokens is:

$z_0 = [x_p^1 E; x_p^2 E; \ldots; x_p^{N_p} E] + E_\text{pos}$

This is processed through the 24 transformer layers:

$z_{l} = \text{TransformerLayer}_l(z_{l-1}),\quad l = 1,\ldots,24$

The final [CLS] token is classified via:

$\hat{y}_\text{model}(t) = \text{softmax}(W_\text{cls} \cdot z_{24}[0])$

For each tile, the top-K predicted classes (typically $K=9$ ) are retained as candidate species.

3. Visual-Cluster Priors and Bayesian Inference

To address label imbalance and exploit spatial/geographic relationships in the dataset, the pipeline estimates unsupervised priors for reweighting predictions.

PaCMAP + K-Means Clustering: Each full quadrat’s [CLS] embedding is reduced to 2D using PaCMAP and grouped into $K=3$ clusters via K-Means. Each “region” in the test set is assigned to the cluster containing the majority of its images.
Empirical Cluster Priors: For each cluster $c$ , a prior probability distribution over species is computed:

$P(y|c) = \frac{1}{|C|} \sum_{j \in C} \hat{y}_\text{model}(\text{image }j)$

where $C$ indexes images in cluster $c$ ; $\hat{y}_\text{model}(\text{image }j)$ is the average prediction across that image’s tiles.

Tile Prediction Reweighting: For tile $t$ in cluster $c$ , the adjusted per-class posterior is given by:

$P(y|t, c) \propto \hat{y}_\text{model}(t) \cdot P(y|c)$

Renormalization ensures a valid probability distribution.

4. Aggregation and Multi-Species Decision Logic

Species predictions are aggregated across tiles to produce final image-level decisions:

For each tile $t_i$ , the top-K species under $P(y|t_i, c)$ are selected as “votes.”
Across all $16$ tiles, voted species counts are tabulated.
The final prediction for the entire quadrat consists of the $M$ most-frequently voted species, with $M$ set by optimizing macro-F1 on the validation set.

This hierarchical, majority-vote mechanism allows fine-grained spatial distinctions within large images and incorporates location-dependent priors.

5. Evaluation, Ablations, and Key Insights

Ablation results reported on the PlantCLEF 2025 private leaderboard demonstrate the impact of each pipeline component. Macro-F1 scores are summarized in the following table:

Method	Top-K	Tiles	Private Macro-F1	Public Macro-F1
ViT-only	20	—	0.0063	0.0116
ViT + 4×4 tiling	20	4×4	0.2631	0.2524
ViT + 4×4 tiling	9	4×4	0.3442	0.3081
ViT + GeoFilter	10	4×4	0.3449	0.3160
ViT + ClusterPriors	9	4×4	0.3483	0.2929

Key findings:

Full-image inference (single ViT pass) yields near-random performance.
Matching tile size to the ViT’s $518\times518$ receptive field is essential; tiling alone dramatically increases F1.
Clustering-based Bayesian priors yield further improvements at negligible computational cost.
Geolocation filtering provides an alternative but does not combine with visual clustering in the reported ablations.

6. Accessibility and Reproducibility

All source code, configuration files, and reproducibility scripts for the ViTD2PC24All pipeline are open-sourced and available at [github.com/dsgt-arc/plantclef-2025]. The implementation adheres to the official PlantCLEF 2024 training recipe and employs standard Vision Transformer training and inference frameworks. No additional supervised training is required for domain-adaptation components, which rely solely on clustering and reweighting derived from the unlabeled test set (Gustineli et al., 8 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ViTD2PC24All Model.