Papers
Topics
Authors
Recent
Search
2000 character limit reached

Revealing data leakage in protein interaction benchmarks

Published 16 Apr 2024 in cs.LG | (2404.10457v1)

Abstract: In recent years, there has been remarkable progress in machine learning for protein-protein interactions. However, prior work has predominantly focused on improving learning algorithms, with less attention paid to evaluation strategies and data preparation. Here, we demonstrate that further development of machine learning methods may be hindered by the quality of existing train-test splits. Specifically, we find that commonly used splitting strategies for protein complexes, based on protein sequence or metadata similarity, introduce major data leakage. This may result in overoptimistic evaluation of generalization, as well as unfair benchmarking of the models, biased towards assessing their overfitting capacity rather than practical utility. To overcome the data leakage, we recommend constructing data splits based on 3D structural similarity of protein-protein interfaces and suggest corresponding algorithms. We believe that addressing the data leakage problem is critical for further progress in this research area.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Essential Cell Biology. Garland Science, 2015. URL https://wwnorton.com/books/9781324033356.
  2. Improving de novo protein binder design with deep learning. Nature Communications, 14(1):2625, 2023. doi: 10.1038/s41467-023-38328-5. URL https://doi.org/10.1038/s41467-023-38328-5.
  3. Methods for the detection and analysis of protein–protein interactions. Proteomics, 7(16):2833–2842, 2007. doi: 10.1002/pmic.200700131. URL https://pubmed.ncbi.nlm.nih.gov/17640003/.
  4. The Protein Data Bank. Nucleic Acids Research, 28(1):235–242, 2000. doi: 10.1093/nar/28.1.235. URL https://doi.org/10.1093/nar/28.1.235.
  5. Cracking the black box of deep sequence-based protein-protein interaction prediction. bioRxiv, pp.  2023–01, 2023. doi: 10.1093/bib/bbae076. URL https://doi.org/10.1093/bib/bbae076.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020. doi: 10.48550/arXiv.2005.14165. URL https://doi.org/10.48550/arXiv.2005.14165.
  7. Global distribution of conformational states derived from redundant models in the pdb points to non-uniqueness of the protein structure. Proceedings of the National Academy of Sciences, 106(26):10505–10510, 2009. doi: 10.1073/pnas.0812152106. URL https://pubmed.ncbi.nlm.nih.gov/19553204/.
  8. Learning to design protein-protein interactions with enhanced generalization. The Twelfth International Conference on Learning Representations, 2024. doi: 10.48550/arXiv.2310.18515. URL https://doi.org/10.48550/arXiv.2310.18515.
  9. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chemical Science, 2024. doi: 10.48550/arXiv.2308.05777. URL https://doi.org/10.48550/arXiv.2308.05777.
  10. Biopython: Python tools for computational biology. ACM Sigbio Newsletter, 20(2):15–19, 2000. doi: 10.1145/360262.360268. URL https://doi.org/10.1145/360262.360268.
  11. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, 381(6664):eadg7492, 2023. doi: 10.1126/science.adg7492. URL https://doi.org/10.1126/science.adg7492.
  12. Flexible protein-protein docking with a multi-track iterative transformer. Protein Science, pp.  e4862, 2023. doi: 10.1101/2023.06.29.547134. URL https://pubmed.ncbi.nlm.nih.gov/37425754/.
  13. DiffDock: Diffusion steps, twists, and turns for molecular docking. arXiv preprint arXiv:2210.01776, 2022. doi: 10.48550/arXiv.2210.01776. URL https://doi.org/10.48550/arXiv.2210.01776.
  14. Robust deep learning–based protein sequence design using ProteinMPNN. Science, 378(6615):49–56, 2022. doi: 10.1126/science.add2187. URL https://www.science.org/doi/10.1126/science.add2187.
  15. Split and merge proxy: pre-training protein-protein contact prediction by mining rich information from monomer data, 2023. URL https://openreview.net/pdf?id=o8fqVVKN3H.
  16. Protein complex prediction with AlphaFold-Multimer. biorxiv, pp.  2021–10, 2021. doi: 10.1101/2021.10.04.463034. URL https://doi.org/10.1101/2021.10.04.463034.
  17. Protein interface prediction using graph convolutional networks. Advances in Neural Information Processing Systems, 30, 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/f507783927f2ec2737ba40afbd17efb5-Paper.pdf.
  18. De novo design of protein interactions with learned surface fingerprints. Nature, pp.  1–9, 2023. doi: 10.1038/s41586-023-05993-x. URL https://doi.org/10.1038/s41586-023-05993-x.
  19. Independent SE(3)-equivariant models for end-to-end rigid protein docking. arXiv preprint arXiv:2111.07786, 2021. doi: 10.48550/arXiv.2111.07786. URL https://doi.org/10.48550/arXiv.2111.07786.
  20. Mu Gao and Jeffrey Skolnick. iAlign: a method for the structural comparison of protein–protein interfaces. Bioinformatics, 26(18):2259–2265, 2010a. doi: 10.1093/bioinformatics/btq404. URL https://doi.org/10.1093/bioinformatics/btq404.
  21. Mu Gao and Jeffrey Skolnick. Structural space of protein–protein interfaces is degenerate, close to complete, and highly connected. Proceedings of the National Academy of Sciences, 107(52):22517–22522, 2010b. doi: 10.1073/pnas.1012820107. URL https://doi.org/10.1073/pnas.1012820107.
  22. How many protein-protein interactions types exist in nature? PLoS One, 7(6):e38913, 2012. doi: 10.1371/journal.pone.0038913. URL https://doi.org/10.1371/journal.pone.0038913.
  23. Finding the ΔΔ\Deltaroman_ΔΔΔ\Deltaroman_ΔG spot: Are predictors of binding affinity changes upon mutations in protein–protein interactions ready for it? Wiley Interdisciplinary Reviews: Computational Molecular Science, 9(5):e1410, 2019. doi: 10.1002/wcms.1410. URL https://api.semanticscholar.org/CorpusID:91262027.
  24. iScore: a novel graph kernel-based function for scoring protein–protein docking models. Bioinformatics, 36(1):112–121, 2020. doi: 10.1093/bioinformatics/btz496. URL https://doi.org/10.1093/bioinformatics/btz496.
  25. Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences. Nature Communications, 12(1):1396, 2021. doi: 10.1038/s41467-021-21636-z. URL https://doi.org/10.1038/s41467-021-21636-z.
  26. SKEMPI 2.0: an updated benchmark of changes in protein–protein binding energy, kinetics and thermodynamics upon mutation. Bioinformatics, 35(3):462–469, 2019. doi: 10.1093/bioinformatics/bty635. URL https://pubmed.ncbi.nlm.nih.gov/30020414/.
  27. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021. doi: 10.1038/s41586-021-03819-2. URL https://doi.org/10.1038/s41586-021-03819-2.
  28. DiffDock-PP: Rigid protein-protein docking with diffusion models. arXiv preprint arXiv:2304.03889, 2023. doi: 10.48550/arXiv.2304.03889. URL https://doi.org/10.48550/arXiv.2304.03889.
  29. The three-dimensional structure of canavalin from jack bean (canavalia ensiformis). Plant Physiology, 101(3):729–744, 1993. doi: 10.1104/pp.101.3.729. URL https://pubmed.ncbi.nlm.nih.gov/8310056/.
  30. Interscaffolding additivity: binding of P1 variants of bovine pancreatic trypsin inhibitor to four serine proteases. Journal of Molecular Biology, 289(1):175–186, 1999. doi: 10.1006/jmbi.1999.2757. URL https://pubmed.ncbi.nlm.nih.gov/10339415/.
  31. Predicting changes in protein thermodynamic stability upon point mutation with deep 3d convolutional neural networks. PLoS computational biology, 16(11):e1008291, 2020. doi: 10.1371/journal.pcbi.1008291. URL https://doi.org/10.1371/journal.pcbi.1008291.
  32. Leak proof PDBBind: A reorganized dataset of protein-ligand complexes for more generalizable binding affinity prediction. ArXiv, 2023. doi: 10.48550/arXiv.2308.09639. URL https://doi.org/10.48550/arXiv.2308.09639.
  33. Predicting mutational effects on protein-protein binding via a side-chain diffusion probabilistic model. arXiv preprint arXiv:2310.19849, 2023. doi: 10.48550/arXiv.2310.19849. URL https://doi.org/10.48550/arXiv.2310.19849.
  34. Deep geometric representations for modeling effects of mutations on protein-protein binding affinity. PLoS computational biology, 17(8):e1009284, 2021. doi: 10.1371/journal.pcbi.1009284. URL https://doi.org/10.1371/journal.pcbi.1009284.
  35. Rotamer density estimator is an unsupervised learner of the effect of mutations on protein-protein interaction. In The Eleventh International Conference on Learning Representations, 2023. doi: 10.1101/2023.02.28.530137. URL https://openreview.net/forum?id=_X9Yl1K2mD.
  36. Topology independent structural matching discovers novel templates for protein interfaces. Bioinformatics, 34(17):i787–i794, 2018. doi: 10.1093/bioinformatics/bty587. URL https://doi.org/10.1093/bioinformatics/bty587.
  37. Geometric transformers for protein interface contact prediction. In International Conference on Learning Representations, 2022. doi: 10.48550/arXiv.2110.02423. URL https://doi.org/10.48550/arXiv.2110.02423.
  38. DIPS-Plus: The enhanced database of interacting protein structures for interface prediction. Scientific Data, 10(1):509, 2023. doi: 10.48550/arXiv.2106.04362. URL https://doi.org/10.48550/arXiv.2106.04362.
  39. DeepRank: A new deep architecture for relevance ranking in information retrieval. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp.  257–266, 2017. doi: 10.48550/arXiv.1710.05649. URL https://doi.org/10.48550/arXiv.1710.05649.
  40. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021. doi: 10.48550/arXiv.2103.00020. URL https://doi.org/10.48550/arXiv.2103.00020.
  41. DeepRank-GNN: a graph neural network framework to learn patterns in protein–protein interfaces. Bioinformatics, 39(1):btac759, 2023. doi: 10.1093/bioinformatics/btac759. URL https://doi.org/10.1093/bioinformatics/btac759.
  42. Deeprank: a deep learning framework for data mining 3d protein-protein interfaces. Nature communications, 12(1):7068, 2021. doi: 10.1038/s41467-021-27396-0. URL https://doi.org/10.1038/s41467-021-27396-0.
  43. Growing ecosystem of deep learning methods for modeling protein–protein interactions. Protein Engineering, Design and Selection, 36:gzad023, 2023. doi: 10.1093/protein/gzad023. URL https://doi.org/10.1093/protein/gzad023.
  44. A structural database of chain–chain and domain–domain interfaces of proteins. Protein Science, 31(9):e4406, 2022. doi: 10.1002/pro.4406. URL https://doi.org/10.1002/pro.4406.
  45. Deep learning guided optimization of human antibody against SARS-CoV-2 variants with broad neutralization. Proceedings of the National Academy of Sciences, 119(11):e2122954119, 2022. doi: 10.1073/pnas.2122954119. URL https://doi.org/10.1073/pnas.2122954119.
  46. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11):1026–1028, 2017. doi: 10.1038/nbt.3988. URL https://doi.org/10.1038/nbt.3988.
  47. Clustering huge protein sequence sets in linear time. Nature communications, 9(1):2542, 2018. doi: 10.1038/s41467-018-04964-5. URL https://doi.org/10.1038/s41467-018-04964-5.
  48. UniRef: comprehensive and non-redundant uniprot reference clusters. Bioinformatics, 23(10):1282–1288, 2007. doi: 10.1093/bioinformatics/btm098. URL https://pubmed.ncbi.nlm.nih.gov/17379688/.
  49. DiffMaSIF: Surface-based protein-protein docking with diffusion models. In Machine Learning in Structural Biology workshop at NeurIPS 2023, 2023. URL https://hal.science/hal-04360638.
  50. Machine learning on protein–protein interaction prediction: models, challenges and trends. Briefings in Bioinformatics, 24(2):bbad076, 2023. doi: 10.1093/bib/bbad076. URL https://doi.org/10.1093/bib/bbad076.
  51. The UniProt Consortium. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Research, 51(D1):D523–D531, 2023. doi: 10.1093/nar/gkac1052. URL https://doi.org/10.1093/nar/gkac1052.
  52. End-to-end learning on 3D protein structure for interface prediction. Advances in Neural Information Processing Systems, 32, 2019. doi: 10.48550/arXiv.1807.01297. URL https://doi.org/10.48550/arXiv.1807.01297.
  53. Quantification of biases in predictions of protein–protein binding affinity changes upon mutations. Briefings in Bioinformatics, 25(1):bbad491, 2024. doi: 10.1093/bib/bbad491. URL https://doi.org/10.1093/bib/bbad491.
  54. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, pp.  1–4, 2023. doi: 10.1038/s41587-023-01773-0. URL https://doi.org/10.1038/s41587-023-01773-0.
  55. DockNet: high-throughput protein–protein interface contact prediction. Bioinformatics, 39(1):btac797, 2023. doi: 10.1093/bioinformatics/btac797. URL https://pubmed.ncbi.nlm.nih.gov/36484688/.
  56. DeepRank-GNN-esm: A graph neural network for scoring protein-protein models using protein language model. Bioinformatics Advances, 4(1):vbad191, 2024. doi: 10.1093/bioadv/vbad191. URL https://doi.org/10.1093/bioadv/vbad191.
  57. Rigid protein-protein docking via equivariant elliptic-paraboloid interface prediction. arXiv preprint arXiv:2401.08986, 2024. doi: 10.48550/arXiv.2401.08986. URL https://doi.org/10.48550/arXiv.2401.08986.
  58. MpbPPI: a multi-task pre-training-based equivariant approach for the prediction of the effect of amino acid mutations on protein–protein interactions. Briefings in Bioinformatics, 24(5):bbad310, 2023. doi: 10.1093/bib/bbad310. URL https://doi.org/10.1093/bib/bbad310.
  59. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nature methods, 19(9):1109–1115, 2022. doi: 10.1038/s41592-022-01585-1. URL https://doi.org/10.1038/s41592-022-01585-1.
  60. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research, 33(7):2302–2309, 2005. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1084323/.
Citations (3)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.