Molecule-Morphology Contrastive Pretraining for Transferable Molecular Representation
Abstract: Image-based profiling techniques have become increasingly popular over the past decade for their applications in target identification, mechanism-of-action inference, and assay development. These techniques have generated large datasets of cellular morphologies, which are typically used to investigate the effects of small molecule perturbagens. In this work, we extend the impact of such dataset to improving quantitative structure-activity relationship (QSAR) models by introducing Molecule-Morphology Contrastive Pretraining (MoCoP), a framework for learning multi-modal representation of molecular graphs and cellular morphologies. We scale MoCoP to approximately 100K molecules and 600K morphological profiles using data from the JUMP-CP Consortium and show that MoCoP consistently improves performances of graph neural networks (GNNs) on molecular property prediction tasks in ChEMBL20 across all dataset sizes. The pretrained GNNs are also evaluated on internal GSK pharmacokinetic data and show an average improvement of 2.6% and 6.3% in AUPRC for full and low data regimes, respectively. Our findings suggest that integrating cellular morphologies with molecular graphs using MoCoP can significantly improve the performance of QSAR models, ultimately expanding the deep learning toolbox available for QSAR applications.
- Low Data Drug Discovery with One-Shot Learning. ACS Central Science, 3(4):283–293, April 2017. ISSN 2374-7943. doi: 10.1021/acscentsci.6b00367. Publisher: American Chemical Society.
- Improving Phenotypic Measurements in High-Content Imaging Screens, July 2017.
- Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nature Protocols, 11(9):1757–1774, September 2016. ISSN 1750-2799. doi: 10.1038/nprot.2016.105.
- A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay. GigaScience, 6(12):giw014, December 2017. ISSN 2047-217X. doi: 10.1093/gigascience/giw014.
- Benchmarking Accuracy and Generalizability of Four Graph Neural Networks Using Large In Vitro ADME Datasets from Different Chemical Spaces. Molecular Informatics, 41(8):2100321, 2022. ISSN 1868-1751. doi: 10.1002/minf.202100321.
- Atom pairs as molecular features in structure-activity studies: definition and applications. Journal of Chemical Information and Computer Sciences, 25(2):64–73, May 1985. ISSN 0095-2338. doi: 10.1021/ci00046a002. Publisher: American Chemical Society.
- Image-based profiling for drug discovery: due for a machine-learning upgrade? Nature Reviews Drug Discovery, 20(2):145–159, February 2021. ISSN 1474-1784. doi: 10.1038/s41573-020-00117-w. Number: 2 Publisher: Nature Publishing Group.
- JUMP Cell Painting dataset: morphological impact of 136,000 chemical and genetic perturbations, March 2023.
- UNITER: UNiversal Image-TExt Representation Learning, July 2020. arXiv:1909.11740 [cs].
- Reproducible scaling laws for contrastive language-image learning, December 2022. arXiv:2212.07143 [cs].
- Functional immune mapping with deep-learning enabled phenomics applied to immunomodulatory and COVID-19 drug discovery. Technical report, bioRxiv, August 2020. Section: New Results Type: article.
- VirTex: Learning Visual Representations from Textual Annotations, September 2021. arXiv:2006.06666 [cs].
- RxRx3: Phenomics Map of Biology, February 2023. Pages: 2023.02.07.527350 Section: New Results.
- CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP, November 2022. arXiv:2110.11316 [cs].
- ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research, 40(D1):D1100–D1107, January 2012. ISSN 0305-1048. doi: 10.1093/nar/gkr777.
- Unsupervised phenotypic analysis of cellular images with multi-scale convolutional neural networks, July 2018.
- Accurate Prediction of Biological Assays with High-Throughput Microscopy Images and Convolutional Networks. Journal of Chemical Information and Modeling, 59(3):1163–1171, March 2019. ISSN 1549-9596. doi: 10.1021/acs.jcim.8b00670. Publisher: American Chemical Society.
- Strategies for Pre-training Graph Neural Networks, February 2020. arXiv:1905.12265 [cs, stat].
- Open Graph Benchmark: Datasets for Machine Learning on Graphs, February 2021. arXiv:2005.00687 [cs, stat].
- Molecular graph convolutions: moving beyond fingerprints. Journal of Computer-Aided Molecular Design, 30(8):595–608, August 2016. ISSN 1573-4951. doi: 10.1007/s10822-016-9938-8.
- Modeling Industrial ADMET Data with Multitask Networks, January 2017. arXiv:1606.08793 [stat].
- Adam: A Method for Stochastic Optimization, January 2017. arXiv:1412.6980 [cs].
- Automated analysis of high-content microscopy data with deep learning. Molecular Systems Biology, 13(4):924, April 2017. ISSN 1744-4292. doi: 10.15252/msb.20177551.
- Connecting Phenotype and Chemotype: High-Content Discovery Strategies for Natural Products Research. Journal of Natural Products, 78(3):587–596, March 2015. ISSN 0163-3864. doi: 10.1021/acs.jnatprod.5b00017. Publisher: American Chemical Society.
- Decoupled Weight Decay Regularization, January 2019. arXiv:1711.05101 [cs, math].
- Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chemical Science, 9(24):5441–5451, June 2018. ISSN 2041-6539. doi: 10.1039/C8SC00148K.
- Meta-Learning GNN Initializations for Low-Resource Molecular Property Prediction, July 2020. arXiv:2003.05996 [physics, stat].
- Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors. Journal of Chemical Information and Computer Sciences, 27(2):82–85, May 1987. ISSN 0095-2338. doi: 10.1021/ci00054a008. Publisher: American Chemical Society.
- Representation Learning with Contrastive Predictive Coding, January 2019. arXiv:1807.03748 [cs, stat].
- Learning Transferable Visual Models From Natural Language Supervision, February 2021. arXiv:2103.00020 [cs].
- Massively Multitask Networks for Drug Discovery, February 2015. arXiv:1502.02072 [cs, stat].
- Active-learning strategies in computer-assisted drug discovery. Drug Discovery Today, 20(4):458–465, April 2015. ISSN 1359-6446. doi: 10.1016/j.drudis.2014.12.004.
- Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling, 50(5):742–754, May 2010. ISSN 1549-9596. doi: 10.1021/ci100050t. Publisher: American Chemical Society.
- Contrastive learning of image- and structure-based representations in drug discovery. May 2022.
- Next generation 3D pharmacophore modeling. WIREs Computational Molecular Science, 10(4):e1468, 2020. ISSN 1759-0884. doi: 10.1002/wcms.1468. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/wcms.1468.
- Chemical Similarity Using Geometric Atom Pair Descriptors. Journal of Chemical Information and Computer Sciences, 36(1):128–136, January 1996. ISSN 0095-2338. doi: 10.1021/ci950275b. Publisher: American Chemical Society.
- Less is more: sampling chemical space with active learning. The Journal of Chemical Physics, 148(24):241733, June 2018. ISSN 0021-9606, 1089-7690. doi: 10.1063/1.5023802. arXiv:1801.09319 [physics, stat].
- CellProfiler 4: improvements in speed, utility and usability. BMC Bioinformatics, 22(1):433, September 2021. ISSN 1471-2105. doi: 10.1186/s12859-021-04344-9.
- RxRx1: A Dataset for Evaluating Experimental Batch Correction Methods, January 2023. arXiv:2301.05768 [cs].
- Contrastive Multiview Coding, December 2020. arXiv:1906.05849 [cs].
- Way, G. P. Blocklist Features - Cell Profiler. November 2019. doi: 10.6084/m9.figshare.10255811.v3. Type: dataset.
- Predicting cell health phenotypes using image-based morphology profiling. Molecular Biology of the Cell, 32(9):995–1005, April 2021. ISSN 1059-1524. doi: 10.1091/mbc.E20-12-0784.
- MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2):513–530, January 2018. ISSN 2041-6539. doi: 10.1039/C7SC02664A.
- Demystifying Multitask Deep Neural Networks for Quantitative Structure–Activity Relationships. Journal of Chemical Information and Modeling, 57(10):2490–2504, October 2017. ISSN 1549-9596. doi: 10.1021/acs.jcim.7b00087. Publisher: American Chemical Society.
- Analyzing Learned Molecular Representations for Property Prediction. Journal of Chemical Information and Modeling, 59(8):3370–3388, August 2019. ISSN 1549-9596. doi: 10.1021/acs.jcim.9b00237.
- Contrastive Learning of Medical Visual Representations from Paired Images and Text, October 2020. arXiv:2010.00747 [cs] version: 1.
- Cross-modal Graph Contrastive Learning with Cellular Images, September 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.