Papers
Topics
Authors
Recent
Search
2000 character limit reached

Large Language Models are In-Context Molecule Learners

Published 7 Mar 2024 in cs.CL and cs.AI | (2403.04197v4)

Abstract: LLMs have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose In-Context Molecule Adaptation (ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve similar informative context examples. Additionally, Post-retrieval Re-ranking is composed of Sequence Reversal and Random Walk selection to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context learning and reasoning capability of LLMs with the retrieved examples and adapts the parameters of LLMs for better alignment between molecules and texts. Experimental results demonstrate that ICMA can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  3. Selective photoredox trifluoromethylation of tryptophan-containing peptides. European Journal of Organic Chemistry, 2019(46):7596–7605.
  4. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  5. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 375–413, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  6. Conflicts, villains, resolutions: Towards models of narrative media framing. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8712–8732, Toronto, Canada. Association for Computational Linguistics.
  7. Material design for next-generation mrna vaccines using lipid nanoparticles. Polymer Reviews, 63(2):394–436.
  8. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  9. Mistral 7b. arXiv preprint arXiv:2310.06825.
  10. Pubchem 2019 update: improved access to chemical data. Nucleic acids research, 47(D1):D1102–D1109.
  11. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024.
  12. Empowering molecule discovery for molecule-caption translation with large language models: A chatgpt perspective. arXiv preprint arXiv:2306.06615.
  13. Molxpt: Wrapping molecules with text for generative pre-training. arXiv preprint arXiv:2305.10688.
  14. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15623–15638.
  15. Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1102–1123.
  16. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
  17. A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481.
  18. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
  19. Molecular farming in plants: host systems and expression technology. TRENDS in Biotechnology, 21(12):570–578.
  20. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  21. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  22. David Weininger. 1988. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36.
  23. Mole-bert: Rethinking pre-training graph neural networks for molecules. In The Eleventh International Conference on Learning Representations.
  24. A systematic survey of chemical pre-trained models. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 6787–6795.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 110 likes about this paper.