CAPRMIL: Context-Aware Patch Representations for Multiple Instance Learning

Published 16 Dec 2025 in cs.CV and cs.AI | (2512.14540v1)

Abstract: In computational pathology, weak supervision has become the standard for deep learning due to the gigapixel scale of WSIs and the scarcity of pixel-level annotations, with Multiple Instance Learning (MIL) established as the principal framework for slide-level model training. In this paper, we introduce a novel setting for MIL methods, inspired by proceedings in Neural Partial Differential Equation (PDE) Solvers. Instead of relying on complex attention-based aggregation, we propose an efficient, aggregator-agnostic framework that removes the complexity of correlation learning from the MIL aggregator. CAPRMIL produces rich context-aware patch embeddings that promote effective correlation learning on downstream tasks. By projecting patch features -- extracted using a frozen patch encoder -- into a small set of global context/morphology-aware tokens and utilizing multi-head self-attention, CAPRMIL injects global context with linear computational complexity with respect to the bag size. Paired with a simple Mean MIL aggregator, CAPRMIL matches state-of-the-art slide-level performance across multiple public pathology benchmarks, while reducing the total number of trainable parameters by 48%-92.8% versus SOTA MILs, lowering FLOPs during inference by 52%-99%, and ranking among the best models on GPU memory efficiency and training time. Our results indicate that learning rich, context-aware instance representations before aggregation is an effective and scalable alternative to complex pooling for whole-slide analysis. Our code is available at https://github.com/mandlos/CAPRMIL

Abstract PDF Upgrade to Chat

Summary

The paper introduces CAPRMIL, which shifts context modeling to the patch embedding stage, reducing reliance on heavy MIL aggregators.
It achieves linear complexity, cutting trainable parameters by up to 92.8% and inference FLOPs by up to 99% while maintaining competitive accuracy.
Visualizations demonstrate that CAPRMIL clusters reflect morphologically coherent regions, enhancing interpretability for clinical applications.

CAPRMIL: Context-Aware Patch Representations for Efficient and Scalable Multiple Instance Learning

Introduction and Motivation

The digitization of pathology has necessitated robust, scalable frameworks for the analysis of gigapixel-scale Whole Slide Images (WSIs), where only slide-level supervisory signals are available. Standard embedding-based and attention-based MIL architectures, such as ABMIL, CLAM, TransMIL, and recent probabilistic and prototype-based MILs, have demonstrated success but are constrained by quadratic complexity of attention, susceptibility to overfitting, and heavy dependence on sophisticated aggregators. This work introduces CAPRMIL, a context-aware MIL paradigm that fundamentally shifts the locus of correlation modeling from the MIL aggregator to the patch embedding stage, drawing inspiration from operator learning advances in neural PDE solvers.

CAPRMIL Framework

CAPRMIL decouples the challenges of context modeling and bag-level aggregation, ensuring that each patch embedding is enriched with global, morphology-aware context before entering a lightweight MIL aggregation module. The pipeline is as follows:

Patch Extraction and Projection: WSIs are tessellated and encoded into feature embeddings by a frozen backbone (e.g., a large-scale foundation model), then linearly projected into a compact latent space for computational efficiency.
Context-Aware Patch Encoding: A stack of CAPRMIL Blocks augments each patch with global context using multi-head self-attention not over the native sequence of patches, but over a compact set of context/morphology tokens derived via soft clustering of patch representations. This mechanism is mathematically and algorithmically analogous to efficient transformer solvers for PDEs such as Transolver and Transolver++.
Figure 1: The CAPRMIL framework, from patch sampling to slide-level prediction via context-aware patch encoding and pooling.

Within each CAPRMIL Block, soft clustering maps patches to $M \ll N$ clusters, which are aggregated into tokens; self-attention is performed in this low-dimensional space, and global context is broadcast back to patches using the soft assignment matrix.

Figure 2: (a) Transformer-based CAPRMIL Block design. (b) Per-head clustering, token aggregation, attention, and context broadcast.

MIL Aggregation and Classification: The context-enriched patch embeddings are pooled (mean, attention, or gated attention), and a simple classifier is trained on the resulting slide-level vector.

This approach results in overall linear complexity with respect to the number of input patches, a significant reduction from the $O(N^2)$ complexity of classic transformer-based aggregators.

Experimental Evaluation

Datasets and Protocols

CAPRMIL is extensively evaluated on four major computational pathology benchmarks:

CAMELYON16 (tumor detection, binary classification, bags up to 20,000 patches)
TCGA-NSCLC (lung cancer, binary classification)
PANDA (prostate ISUP grading, 6-class classification)
BRACS (breast lesion coarse classification, 3-class problem with atypical, benign, malignant categories)

For all datasets, patches are extracted at canonical resolutions and featurized using the frozen UNIv1 encoder.

Discriminative Power and Efficiency

CAPRMIL, even when coupled with vanilla mean-pooling aggregation, achieves competitive slide-level classification performance—matching or closely trailing SOTA methods—with dramatic gains in efficiency. Notable highlights include:

Parameter and Compute Efficiency: CAPRMIL reduces trainable parameters by 48–92.8% versus leading MILs. Inference FLOPs are reduced by 52–99%, with improvements in both GPU memory footprint and wall-clock training times.
Accuracy-Complexity Trade-off: On large-bag tasks (e.g., CAMELYON16, BRACS), a naive mean baseline underperforms by up to 44%, whereas CAPRMIL with mean aggregation nearly closes the performance gap with much heavier models reliant on full attention. On multiclass problems (PANDA, BRACS), attention-based aggregation gains marginal advantage, indicating that CAPRMIL’s context-aware representations retain most relevant discriminative structure.
Token Morphology Specificity: Visualizations of token–patch assignment heatmaps demonstrate that CAPRMIL clusters align with interpretable histological motifs—e.g., adipose tissue, malignant epithelia, stroma—ensuring that the learned representations are coherent and potentially more robust to irrelevant context and noise.
Figure 3: Token–patch assignments on CAMELYON16; each context-aware token aggregates morphologically coherent regions.

Resource Utilization

When analyzing practical deployment considerations (GPU memory, training time, accuracy), CAPRMIL stands out for minimal hardware demand yet maintains top-tier accuracy.

Figure 4: (a) GPU memory footprint and training time. (b) Memory efficiency versus classification accuracy across models.

Robustness and Modularity

Ablation studies on core architectural hyperparameters (clusters $M$ , heads $H$ , MLP expansion) reveal broad insensitivity, with optimality typically reached with low numbers (e.g., $M=4$ , $H=8$ ). Further, swapping the final aggregation mechanism (mean, attention, gated) demonstrates that the gains are predominantly secured during the context-aware encoding phase, not aggregation—showing CAPRMIL's broad modularity.

Implications and Future Directions

CAPRMIL’s design leads to several direct theoretical and practical implications for MIL in computational pathology:

Separation of Correlation Learning from Aggregator: By leveraging context-aware patch encodings, later aggregation steps can be greatly simplified, creating avenues for highly efficient, interpretable, and easily modifiable MIL pipelines.
Translational Potential: Linear scaling is crucial for clinical-scale WSI pipelines, particularly as slide sizes and cohort numbers grow. CAPRMIL, decoupled from expensive per-slide optimization of attention heads, is readily extensible to multi-modal or multi-resolution MIL.
Link to Operator Learning: The architectural parallel to PDE neural solvers signals a new methodological bridge between computational physics and digital pathology, envisioning further methodological innovation around context modeling.
Interpretability: Morphologically faithful clustering points toward more reliable per-region interpretation, helping address concerns with lack of trust in soft attention scores and boosting clinical acceptability.
Figure 5: Additional token–patch assignment heatmaps indicating robust, interpretable clustering across test slides.

Conclusion

CAPRMIL demonstrates that efficient, context-aware patch tokenization fundamentally advances MIL for WSI analysis: context modeling at the embedding level removes dependence on high-parameter, computation-heavy aggregators, yielding robust, modular, and efficient pipelines. Future developments could leverage this framework for multimodal integration, uncertainty quantification, and scaling to even higher dimensional biomedical datasets.