Extend MolBert to protein representations and improve its learning strategies

Develop methods to apply MolBert, a BERT-based language model for molecular representation learning, to learn representations of other biological entities, specifically proteins, and design improved pre-training and auxiliary-task learning strategies for MolBert to extend its applicability beyond small-molecule SMILES inputs.

Background

MolBert uses the BERT transformer architecture with domain-relevant auxiliary tasks (e.g., Masked Language Modeling, SMILES equivalence, and physicochemical descriptor prediction) to learn molecular embeddings from SMILES. It achieves state-of-the-art performance on virtual screening and QSAR benchmarks, demonstrating the benefit of task selection during pre-training.

While the study focuses on small molecules represented by SMILES, the authors explicitly highlight future directions: adapting MolBert to represent other entities such as proteins and advancing the training strategies used for MolBert. This indicates an unresolved question about generalizing MolBert beyond small molecules and refining its pre-training framework.

References

We leave to future work the exploration of how to use MolBert for learning representations of other entities such as proteins~\citep{simonovsky, Alley:2019fe, Kim2020deeppcm}, along with further developments in our learning strategies~\citep{pretraingraphs2019pande}.