PLUTO: Vision Transformers in Digital Pathology

Updated 18 February 2026

The paper introduces PLUTO, a universal backbone unifying multi-scale image embeddings for diverse pathology tasks with high sample efficiency.
It employs a flexible ViT architecture with variable patch sizes and combined self-supervised losses (DINO, iBOT, MAE, Fourier) to achieve robust performance.
The framework enables rapid adaptation across slide-level, tile-level, and segmentation tasks while ensuring computational efficiency for clinical deployment.

The PathoLogy Universal TransfOrmer (PLUTO) family comprises a suite of domain-specific vision transformer (ViT) foundation models designed for digital pathology image analysis. PLUTO models are architected for efficient extraction of multi-scale representations from gigapixel whole slide images (WSIs), which are characterized by vastly heterogeneous spatial content, diverse tissue classes, and multiple imaging modalities. The PLUTO series aims to unify the embedding space for subcellular, cellular, tissue, and slide-level downstream tasks, using large-scale self-supervised pretraining on diverse datasets, and supporting high-throughput, adaptable clinical deployment (Juyal et al., 2024, Padigela et al., 4 Nov 2025).

1. Model Architecture and Self-Supervised Pretraining

Backbone Architecture

The original PLUTO employs a FlexiViT-S backbone:

12 transformer encoder layers ( $L = 12$ )
Hidden dimension $d = 384$
Attention heads $h = 6$
MLP dimension $d_{ff} = 1536$
Total parameters $\approx 22$ million

The FlexiViT extension enables PLUTO to accept variable patch sizes ( $p \in \{8, 16, 32\}$ pixels), trading off throughput and spatial granularity per use-case. PLUTO-4 extends this to two paradigms:

PLUTO-4S (same as FlexiViT-S, 22M params, 12 layers, 6 heads)
PLUTO-4G (ViT-G/14: 40 layers, 1408 hidden dim, 16 heads, 1.1B params, single patch size $p=14$ )

PLUTO-4 employs 2D Rotary Positional Embeddings (2D-RoPE) to enhance extrapolation across large spatial contexts, applying independent planar rotations as a function of token position for x and y dimensions.

Pretraining Datasets

PLUTO:

195M image tiles from 158,852 WSIs ( $>50$ sites, 16 tissue groups, 28 disease categories, 11 scanner models, H&E/IHC/special stains)
Multi-resolution sampling: 40× (0.25 mpp), 20× (0.5 mpp), 10× (1 mpp), 5× (2 mpp)
Background/artifact exclusion via ArtifactDetect CNN PLUTO-4:
551,164 WSIs, 137,144 patients, $>50$ institutions, $>60$ disease types, $d = 384$ 0 stains, $d = 384$ 1 scanner models
Generated $d = 384$ 2M image tiles via multi-scale cropping
Comparable artifact and region selection pipeline

Self-Supervised Objectives

PLUTO utilizes a combined loss:

$d = 384$ 3

DINO/iBOT: self-distillation with no labels, student-teacher ViT setup
MAE: pixel reconstruction loss on masked patches
Fourier-band loss: penalizes low- and high-frequency discrepancies via DFT: $d = 384$ 4 ( $d = 384$ 5: frequency-domain mask, $d = 384$ 6: DFT, $d = 384$ 7, $d = 384$ 8)

PLUTO-4 is pretrained exclusively with a stabilized DINOv2 loss using a momentum teacher, centering, and temperature annealing for embedding stability.

Training Protocols

AdamW optimizer; linear warmup then cosine LR decay
Data augmentations: random local/global crops, color jitter, resolution sampling
Distributed pretraining (A40 or H200 GPU clusters), mixed precision, gradient clipping

2. Task-Specific Adaptation Mechanisms

PLUTO models are designed to support a spectrum of downstream pathology tasks by coupling the frozen backbone with lightweight adaptation heads:

Slide-Level Prediction

Multiple-Instance Learning (MIL) with attention pooling: Each tile embedding $d = 384$ 9 is aggregated via learned attention coefficients $h = 6$ 0, yielding a bag-level feature $h = 6$ 1
Output: Softmax classifier on top, cross-entropy loss at the slide level

Tile-Level Classification

Heads: Single linear or 2-layer MLP applied to [CLS] token (optionally concatenated with mean/all patch token pooling)
Output: Tile-level softmax prediction, cross-entropy supervision

Instance Segmentation

Adapters for Mask R-CNN or Mask2Former frameworks: Adapter transformer bridges ViT outputs to a feature pyramid network
Outputs: Per-query segmentation masks and class logits
Optimization: Standard detection and segmentation losses

A salient property is that all adaptation heads are trained with the frozen PLUTO encoder, enabling high sample efficiency and rapid transfer across domains and tasks.

3. Benchmarking, Performance, and Scaling

Public Benchmarks

PLUTO and PLUTO-4 are evaluated on a range of tasks, including slide-level classification (e.g. NSCLC subtyping, HER2 scoring), tile/tissue/cell classification (CRC-100K, Camelyon17), and instance segmentation (GlaS, PanNuke, MoNuSAC, ConSep), employing metrics such as macro-F1, AUROC, Dice, IoU, bPQ, and mPQ.

Benchmark	Adaptation	PLUTO/PLUTO-4	SOTA (public)
NSCLC (slide)	MIL	90.2 F1	88.6 (Meta-DINOv2 ViT-S)
CRC-100K (tile)	Linear	96.6 Acc	94.7 (ResNet50)
Camelyon17-WILDS	Linear	96.2 Acc	70.3 (DenseNet121)
GlaS (gland seg.)	Mask2Former	91.2 (Dice)	85.5 (U-Net)
PanNuke (nuclei)	HoverNet	67.1 (bPQ)	55.3 (ResNet50+Mask R-CNN)

Scaling and Efficiency

PLUTO achieves state-of-the-art or near-parity performance despite being 10–100× smaller in model size or dataset than competing FMs.
PLUTO-4G ( $h = 6$ 21.1B params) establishes new performance frontiers (e.g., 11% improvement in dermatopathology diagnosis macro-F1 over prior PLUTO-3 series), while PLUTO-4S maintains rapid inference and flexibility, beneficial for real-time or high-throughput applications.

4. Deployment, Specialization, and Integration

PLUTO models support multi-scale clinical workflows and research pipelines:

PLUTO-4S: 22M params; supports variable patch sizes for context/throughput trade-off; suited for slide triage, region proposals, cell segmentation in resource-constrained or fast-inference environments
PLUTO-4G: 1.1B params; optimized for maximum accuracy on challenging tasks, such as biomarker quantification (PD-L1, HER2), large-scale slide triage, and spatial omics correlation

Integration into PathAI software products exemplifies real-world deployment in biomarker discovery (PathExplore, IHCExplore), workflow triage (TumorDetect, PathAssist Derm), and automated quantification (AIM-PD-L1, AIM-HER2, AIM-TumorCellularity) (Padigela et al., 4 Nov 2025).

Computationally, PLUTO-S (ViT-S) achieves 2.5–15× faster tile-throughput than larger ViTs, with memory footprints suitable for deployment on 16–24 GB GPUs at standard crop sizes. Using larger patch sizes (e.g., $h = 6$ 3) further increases throughput with minimal accuracy degradation.

5. Limitations, Robustness, and Future Directions

PLUTO and PLUTO-4 delineate several considerations for continued foundation model development in pathology:

Adaptation Needs: Both PLUTO and PLUTO-4 serve solely as universal backbones; effective deployment requires well-designed, robust, and interpretable task adapters.
Trade-offs: PLUTO-4G provides maximal performance at the cost of increased computational resources; PLUTO-4S offers speed and multi-scale deployment but may lag slightly in accuracy for the most demanding tasks.
Scalability: Future directions include exploring Fully Sharded Data Parallelism (FSDP) for larger models, expansion to multi-modal pretraining (integrating WSI with genomics/transcriptomics), and scaling laws in diverse pathology settings.
Interpretability and Uncertainty: Research is motivated on interpretable representations and quantification of uncertainty, crucial for diagnostic reliability.
Stain and Modality Expansion: Incorporating stain translation and multiplex fluorescence data as self-supervision targets is identified as a promising avenue.

PLUTO demonstrates that a single frozen, multi-scale transformer backbone can underpin a broad landscape of pathology image tasks with strong sample efficiency, robustness to site and domain shifts, and operational flexibility. The PLUTO-4 generation further suggests that ever-larger, more diverse corpora and advanced ViT architectures can push performance frontiers across varied spatial and biological tasks (Juyal et al., 2024, Padigela et al., 4 Nov 2025).

6. Summary Table: PLUTO and PLUTO-4 Model Comparison

Model	Params	Layers	Patch Sizes	Pretraining Tiles / WSIs	Data Diversity	Notable Features
PLUTO	22M	12	8, 16, 32 (FlexiViT)	195M / 158,852	28 diseases, 100+ stains	Hybrid DINO/iBOT/MAE/Fourier loss
PLUTO-4S	22M	12	8, 16, 32 (FlexiViT)	640M / 551,164	60+ diseases, 100+ stains	2D-RoPE, DINOv2, clinical throughput
PLUTO-4G	1.1B	40	14 (fixed)	640M / 551,164	60+ diseases, 100+ stains	Maximum accuracy, register tokens, 2D-RoPE

*Editor's term: "PLUTO series" denotes all models described above.

7. Significance and Field Impact

PLUTO and PLUTO-4 exemplify domain-adapted foundation models, matching or exceeding state-of-the-art performance in both public and proprietary digital pathology benchmarks, often with considerably reduced data and parameter requirements. High robustness to domain shift is attributed to dataset diversity, not just model capacity, indicating the crucial importance of multi-institutional, multi-stain pretraining. The frozen backbone paradigm facilitates rapid adaptation to varied tasks, with evidence of high sample and compute efficiency. The architecture and methodological insights from PLUTO are shaping research and deployment strategies at scale for both academic and industrial digital pathology (Juyal et al., 2024, Padigela et al., 4 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

PLUTO: Pathology-Universal Transformer (2024)

PLUTO-4: Frontier Pathology Foundation Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entropic JKO Scheme.