Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models

Published 20 Dec 2024 in cs.LG, cs.AI, cs.CV, and stat.ML | (2412.16247v2)

Abstract: Dictionary learning (DL) has emerged as a powerful interpretability tool for LLMs. By extracting known concepts (e.g., Golden-Gate Bridge) from human-interpretable data (e.g., text), sparse DL can elucidate a model's inner workings. In this work, we ask if DL can also be used to discover unknown concepts from less human-interpretable scientific data (e.g., cell images), ultimately enabling modern approaches to scientific discovery. As a first step, we use DL algorithms to study microscopy foundation models trained on multi-cell image data, where little prior knowledge exists regarding which high-level concepts should arise. We show that sparse dictionaries indeed extract biologically-meaningful concepts such as cell type and genetic perturbation type. We also propose Iterative Codebook Feature Learning~(ICFL) and combine it with a pre-processing step which uses PCA whitening from a control dataset. In our experiments, we demonstrate that both ICFL and PCA improve the selectivity of extracted features compared to TopK sparse autoencoders.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces a novel dictionary learning framework with vision transformers to extract precise biological concepts from microscopy data.
It employs layer-wise linear probing and a perturbation consistency benchmark to quantify representation quality across model layers.
Comprehensive genome-level and RxRx1 task evaluations reveal that scaling model size enhances latent feature separability and biological relationship recall.

Sharpening Biological Representation Learning in Vision Transformers for Microscopy

This paper presents an in-depth exploration of training self-supervised foundation models, specifically vision transformers (ViTs), for biological representation learning in microscopy. Leveraging the strengths of masked autoencoders (MAEs) with ViT backbones, the authors propose and evaluate multiple versions of a new suite of models termed Phenom—Phenom-Beta, Phenom-1 variances, and Phenom-2 Gigantic among others—by emphasizing the significance of scaling in model performance enhancement across downstream tasks in phenomics.

Core Contributions

The paper leads through several important contributions to the fields of computation biology and computer vision:

Layer-wise Biological Linear Probing Analyses: The authors introduce a novel suite of analytical techniques to explore the biological representation learning capacity of ViTs when applied to microscopy. Utilizing linear probing across different model layers, they demonstrate that earlier layers could sometimes yield more beneficial representations than final layers for specific tasks.
Perturbation Consistency Benchmark: A new benchmark is introduced named perturbation consistency, aiming to enrich the assessment of precision in biological representation learning tasks, particularly useful in the domain of drug discovery.
Dataset Curation and Model Comparisons: The creation of Phenoprints-16M, a dataset curated with statistically significant positive samples, allows for improved training of MAEs. The paper also contrasts its proposed models with existing models, notably comparing its new Phenom-G/8 state-of-the-art model trained with exhaustive computing resources against a baseline vision transformer trained on natural images.
Full-genome Biological Benchmarking: The research includes comprehensive genome-level benchmarking, with an emphasis on perturbation consistency over traditional biological relationship recall metrics.

Experimental Results and Implications

The authors' experiments emphasize the role of scaling in transformer models for improved latent space separability, reporting that their most deeply scaled models underperform alternative approaches in linearly separating complex biological features. Noteworthy are the findings suggesting that biological relationship recall and perturbation consistency are key indicators of improved representation learning, particularly when leveraging enormous datasets such as Phenoprints-16M.

By proving superior performance in benchmarks such as RxRx1 classification tasks and the novel perturbation consistency analysis, this work significantly contributes to our understanding of how self-supervised models can be generalizable to non-traditional, biology-driven domains distinctive from natural image datasets.

Discussion and Future Directions

The implications of these methods extend beyond their immediate application in microscopy to signal important shifts in how foundation models could transform related fields. Researchers interested in high-content screening assays, known for their role in drug discovery, stand to benefit greatly from the proposed perturbation consistency metric. This could potentially enable the discovery of novel relationships in massive, unannotated datasets—a crucial development in untangling complex biological phenomena.

Future work could extend these findings by iteratively refining the dataset creation methodologies, expanding modular applications into various microscopy techniques, or pushing the computational boundaries with ever-increasing model sizes to assess the observable thinning impact that emerges at extraordinary scales, a concept partially teased upon with Phenom-G/8.

In summary, this paper suggests a substantial leap toward applying self-supervised vision transformers in biological settings, presenting a compelling case for scaling and tailored dataset enhancement in paving the way for leveraging AI in transformative biological research.

Markdown Report Issue