PepDoRA: A Unified Peptide Language Model via Weight-Decomposed Low-Rank Adaptation

Published 28 Oct 2024 in q-bio.BM | (2410.20667v1)

Abstract: Peptide therapeutics, including macrocycles, peptide inhibitors, and bioactive linear peptides, play a crucial role in therapeutic development due to their unique physicochemical properties. However, predicting these properties remains challenging. While structure-based models primarily focus on local interactions, LLMs are capable of capturing global therapeutic properties of both modified and linear peptides. Protein LLMs like ESM-2, though effective for natural peptides, cannot however encode chemical modifications. Conversely, pre-trained chemical LLMs excel in representing small molecule properties but are not optimized for peptides. To bridge this gap, we introduce PepDoRA, a unified peptide representation model. Leveraging Weight-Decomposed Low-Rank Adaptation (DoRA), PepDoRA efficiently fine-tunes the ChemBERTa-77M-MLM on a masked LLM objective to generate optimized embeddings for downstream property prediction tasks involving both modified and unmodified peptides. By tuning on a diverse and experimentally valid set of 100,000 modified, bioactive, and binding peptides, we show that PepDoRA embeddings capture functional properties of input peptides, enabling the accurate prediction of membrane permeability, non-fouling and hemolysis propensity, and via contrastive learning, target protein-specific binding. Overall, by providing a unified representation for chemically and biologically diverse peptides, PepDoRA serves as a versatile tool for function and activity prediction, facilitating the development of peptide therapeutics across a broad spectrum of applications.