One protein is all you need
Abstract: Generalization beyond training data remains a central challenge in machine learning for biology. A common way to enhance generalization is self-supervised pre-training on large datasets. However, aiming to perform well on all possible proteins can limit a model's capacity to excel on any specific one, whereas experimentalists typically need accurate predictions for individual proteins they study, often not covered in training data. To address this limitation, we propose a method that enables self-supervised customization of protein LLMs to one target protein at a time, on the fly, and without assuming any additional data. We show that our Protein Test-Time Training (ProteinTTT) method consistently enhances generalization across different models, their sizes, and datasets. ProteinTTT improves structure prediction for challenging targets, achieves new state-of-the-art results on protein fitness prediction, and enhances function prediction on two tasks. Through two challenging case studies, we also show that customization via ProteinTTT achieves more accurate antibody-antigen loop modeling and enhances 19% of structures in the Big Fantastic Virus Database, delivering improved predictions where general-purpose AlphaFold2 and ESMFold struggle.
- Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, pp. 1–3, 2024.
- Unified rational protein engineering with sequence-based deep representation learning. Nature methods, 16(12):1315–1322, 2019.
- Deeploc: prediction of protein subcellular localization using deep learning. Bioinformatics, 33(21):3387–3395, 2017.
- Glucokinase activity in diabetes: too much of a good thing? Trends in Endocrinology & Metabolism, 34(2):119–130, Feb 2023. ISSN 1043-2760. doi: 10.1016/j.tem.2022.12.007. URL https://doi.org/10.1016/j.tem.2022.12.007.
- Self-supervised test-time learning for reading comprehension. arXiv preprint arXiv:2103.11263, 2021.
- Pada: Example-based prompt learning for on-the-fly adaptation to unseen domains. Transactions of the Association for Computational Linguistics, 10:414–433, 2022.
- Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Learning to design protein-protein interactions with enhanced generalization. arXiv preprint arXiv:2310.18515, 2023.
- One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
- Hotprotein: A novel framework for protein thermostability prediction and editing. NeurIPS 2022, 2022.
- Accurate proteome-wide missense variant effect prediction with alphamissense. Science, 381(6664):eadg7492, 2023.
- Adapting to distribution shift by visual domain prompt generation. arXiv preprint arXiv:2405.02797, 2024.
- David W. Christianson. Structural and chemical biology of terpenoid cyclases. Chemical Reviews, 117(17):11570–11648, Sep 2017. ISSN 0009-2665. doi: 10.1021/acs.chemrev.7b00287. URL https://doi.org/10.1021/acs.chemrev.7b00287.
- The UniProt Consortium. Uniprot: the universal protein knowledgebase in 2023. Nucleic acids research, 51(D1):D523–D531, 2023.
- Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Stability oracle: a structure-based graph-transformer for identifying stabilizing mutations. BioRxiv, pp. 2023–05, 2023.
- Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proceedings of the National Academy of Sciences, 121(6):e2314853121, 2024.
- Uncovering new families and folds in the natural protein universe. Nature, 622(7983):646–653, 2023.
- Improving inverse folding models at protein stability prediction without additional training or data. bioRxiv, pp. 2024–06, 2024.
- Ankh: Optimized protein language model unlocks general-purpose modelling. arXiv preprint arXiv:2301.06568, 2023.
- Mavedb: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome biology, 20:1–11, 2019.
- A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Machine Intelligence, 5(10):1087–1096, 2023.
- Deep reinforcement learning for modelling protein complexes. arXiv preprint arXiv:2405.02299, 2024.
- Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
- Disease variant prediction with deep generative models of evolutionary data. Nature, 599(7883):91–95, 2021.
- Test-time training with masked autoencoders. Advances in Neural Information Processing Systems, 35:29374–29385, 2022.
- Unsupervised domain adaptation by backpropagation. In Francis R. Bach and David M. Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp. 1180–1189. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/ganin15.html.
- Jan Gorodkin. Comparing two k-category assignments by a k-category correlation coefficient. Computational biology and chemistry, 28(5-6):367–374, 2004.
- Sestrin mediates detection of and adaptation to low-leucine diets in drosophila. Nature, 608(7921):209–216, Aug 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-04960-2. URL https://doi.org/10.1038/s41586-022-04960-2.
- cgas–sting drives ageing-related inflammation and neurodegeneration. Nature, 620(7973):374–380, Aug 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06373-1. URL https://doi.org/10.1038/s41586-023-06373-1.
- Structure of dimeric lipoprotein lipase reveals a pore adjacent to the active site. Nature Communications, 14(1):2569, May 2023. ISSN 2041-1723. doi: 10.1038/s41467-023-38243-9. URL https://doi.org/10.1038/s41467-023-38243-9.
- Moritz Hardt and Yu Sun. Test-time training on nearest neighbors for large language models. arXiv preprint arXiv:2305.18466, 2023.
- Simulating 500 million years of evolution with a language model. bioRxiv, pp. 2024–07, 2024.
- Bilingual language model for protein sequence and structure. bioRxiv, pp. 2023–07, 2023.
- Deriving language models from masked language models. arXiv preprint arXiv:2305.15501, 2023.
- Mutation effects predicted from sequence co-variation. Nature biotechnology, 35(2):128–135, 2017.
- The pi3k–akt network at the interface of oncogenic signalling and cancer metabolism. Nature Reviews Cancer, 20(2):74–88, Feb 2020. ISSN 1474-1768. doi: 10.1038/s41568-019-0216-7. URL https://doi.org/10.1038/s41568-019-0216-7.
- Learning inverse folding from millions of predicted structures. In International conference on machine learning, pp. 8946–8970. PMLR, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Alphafold meets flow matching for generating protein ensembles. arXiv preprint arXiv:2402.04845, 2024.
- Highly accurate protein structure prediction with alphafold. nature, 596(7873):583–589, 2021.
- Pseudo-perplexity in one fell swoop for protein fitness estimation. bioRxiv, pp. 2024–07, 2024.
- Test-time adaptable neural networks for robust medical image segmentation. Medical Image Analysis, 68:101907, 2021.
- Lactb is a tumour suppressor that modulates lipid metabolism and cell state. Nature, 543(7647):681–686, Mar 2017. ISSN 1476-4687. doi: 10.1038/nature21408. URL https://doi.org/10.1038/nature21408.
- Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Improving protein optimization with smoothed fitness landscapes. In The Twelfth International Conference on Learning Representations, 2023.
- Machine learning-guided protein engineering. ACS catalysis, 13(21):13863–13895, 2023.
- Critical assessment of methods of protein structure prediction (casp)—round xv. Proteins: Structure, Function, and Bioinformatics, 91(12):1539–1549, 2023. doi: https://doi.org/10.1002/prot.26617. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.26617.
- Gemme: a simple and fast global epistatic model predicting mutational effects. Molecular biology and evolution, 36(11):2604–2619, 2019.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023. doi: 10.1126/science.ade2574. URL https://www.science.org/doi/abs/10.1126/science.ade2574.
- TTT++: when does self-supervised test-time training fail or thrive? In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 21808–21820, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/b618c3210e934362ac261db280128c22-Abstract.html.
- Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41(8):1099–1106, 2023.
- lddt: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics, 29(21):2722–2728, 2013.
- Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems, 34:29287–29303, 2021.
- Clipzyme: Reaction-conditioned virtual screening of enzymes. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=0mYAK6Yhhm.
- Peptide-binding specificity prediction using fine-tuned protein structure prediction networks. Proceedings of the National Academy of Sciences, 120(9):e2216697120, 2023.
- Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, pp. 16990–17017. PMLR, 2022a.
- Trancepteve: Combining family-specific and family-agnostic models of protein sequences for improved fitness prediction. bioRxiv, pp. 2022–12, 2022b.
- Proteingym: Large-scale benchmarks for protein fitness prediction and design. Advances in Neural Information Processing Systems, 36, 2024.
- Apoe isoform– and microbiota-dependent progression of neurodegeneration in a mouse model of tauopathy. Science, 379(6628):eadd1236, 2023. doi: 10.1126/science.add1236. URL https://www.science.org/doi/abs/10.1126/science.add1236.
- A rugged yet easily navigable fitness landscape. Science, 382(6673):eadh3860, 2023. doi: 10.1126/science.adh3860. URL https://www.science.org/doi/abs/10.1126/science.adh3860.
- Predrag Radivojac and et al. A large-scale evaluation of computational protein function prediction. Nature Methods, 10(3):221–227, Mar 2013. ISSN 1548-7105. doi: 10.1038/nmeth.2340. URL https://doi.org/10.1038/nmeth.2340.
- Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019.
- Transformer protein language models are unsupervised structure learners. Biorxiv, pp. 2020–12, 2020.
- Msa transformer. In International Conference on Machine Learning, pp. 8844–8856. PMLR, 2021.
- Generating diverse high-fidelity images with VQ-VAE-2. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 14837–14847, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/5f8e2fa1718d1bbcadf1cd9c7a54fb8c-Abstract.html.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
- Continuous automated model evaluation (cameo)—perspectives on the future of fully automated evaluation of structure prediction methods. Proteins: Structure, Function, and Bioinformatics, 89(12):1977–1986, 2021.
- Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22500–22510, 2023.
- Masked language model scoring. arXiv preprint arXiv:1910.14659, 2019.
- Highly accurate discovery of terpene synthases powered by machine learning reveals functional terpene cyclization in archaea. bioRxiv, 2024. doi: 10.1101/2024.01.29.577750. URL https://www.biorxiv.org/content/early/2024/04/25/2024.01.29.577750.
- Current progress and open challenges for applying deep learning across the biosciences. Nature Communications, 13(1):1728, Apr 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-29268-7. URL https://doi.org/10.1038/s41467-022-29268-7.
- Prot-vae: protein transformer variational autoencoder for functional protein design. bioRxiv, pp. 2023–01, 2023.
- Discovery of novel gain-of-function mutations guided by structure-based deep learning. ACS synthetic biology, 9(11):2927–2935, 2020.
- Accurately predicting enzyme functions through geometric graph learning on esmfold-predicted structures. Nature Communications, 15(1):8180, 2024.
- Light attention predicts protein location from the language of life. Bioinformatics Advances, 1(1):vbab035, 11 2021. ISSN 2635-0041. doi: 10.1093/bioadv/vbab035. URL https://doi.org/10.1093/bioadv/vbab035.
- Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, pp. 2023–10, 2023.
- A paradigm shift in structural biology. Nature Methods, 19(1):20–23, Jan 2022. ISSN 1548-7105. doi: 10.1038/s41592-021-01361-7. URL https://doi.org/10.1038/s41592-021-01361-7.
- Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pp. 9229–9248. PMLR, 2020.
- Antibody domainbed: Out-of-distribution generalization in therapeutic protein design. arXiv preprint arXiv:2407.21028, 2024.
- Mega-scale experimental analysis of protein folding stability in biology and design. Nature, 620(7973):434–444, 2023.
- From genomics to proteomics. Nature, 422(6928):193–197, Mar 2003. ISSN 1476-4687. doi: 10.1038/nature01510. URL https://doi.org/10.1038/nature01510.
- Foldseek: fast and accurate protein structure search. Biorxiv, pp. 2022–02, 2022.
- Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
- A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
- Test-time training on video streams. arXiv preprint arXiv:2307.05014, 2023.
- De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089–1100, 2023.
- Learning to generalize across domains on single test samples. arXiv preprint arXiv:2202.08045, 2022.
- Opportunities and challenges for machine learning-assisted enzyme engineering. ACS Central Science, 10(2):226–241, Feb 2024. ISSN 2374-7943. doi: 10.1021/acscentsci.3c01275. URL https://doi.org/10.1021/acscentsci.3c01275.
- Enzyme function prediction using contrastive learning. Science, 379(6639):1358–1363, 2023. doi: 10.1126/science.adf2465. URL https://www.science.org/doi/abs/10.1126/science.adf2465.
- Adaptive risk minimization: Learning to adapt to domain shift. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 23664–23678, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/c705112d1ec18b97acac7e2d63973424-Abstract.html.
- Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004.
- On pitfalls of test-time adaptation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 42058–42080. PMLR, 2023. URL https://proceedings.mlr.press/v202/zhao23d.html.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.