Papers
Topics
Authors
Recent
Search
2000 character limit reached

PromptLink: Leveraging Large Language Models for Cross-Source Biomedical Concept Linking

Published 13 May 2024 in cs.IR, cs.AI, and cs.CL | (2405.07500v1)

Abstract: Linking (aligning) biomedical concepts across diverse data sources enables various integrative analyses, but it is challenging due to the discrepancies in concept naming conventions. Various strategies have been developed to overcome this challenge, such as those based on string-matching rules, manually crafted thesauri, and machine learning models. However, these methods are constrained by limited prior biomedical knowledge and can hardly generalize beyond the limited amounts of rules, thesauri, or training samples. Recently, LLMs have exhibited impressive results in diverse biomedical NLP tasks due to their unprecedentedly rich prior knowledge and strong zero-shot prediction abilities. However, LLMs suffer from issues including high costs, limited context length, and unreliable predictions. In this research, we propose PromptLink, a novel biomedical concept linking framework that leverages LLMs. It first employs a biomedical-specialized pre-trained LLM to generate candidate concepts that can fit in the LLM context windows. Then it utilizes an LLM to link concepts through two-stage prompts, where the first-stage prompt aims to elicit the biomedical prior knowledge from the LLM for the concept linking task and the second-stage prompt enforces the LLM to reflect on its own predictions to further enhance their reliability. Empirical results on the concept linking task between two EHR datasets and an external biomedical KG demonstrate the effectiveness of PromptLink. Furthermore, PromptLink is a generic framework without reliance on additional prior knowledge, context, or training data, making it well-suited for concept linking across various types of data sources. The source code is available at https://github.com/constantjxyz/PromptLink.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Noura S Abul-Husn and Eimear E Kenny. 2019. Personalized medicine and the power of electronic health records. Cell 177, 1 (2019), 58–69.
  2. Publicly Available Clinical BERT Embeddings. NAACL HLT 2019 (2019), 72.
  3. Alan R Aronson and François-Michel Lang. 2010. An overview of MetaMap: historical perspective and recent advances. Journal of the American Medical Informatics Association 17, 3 (2010), 229–236.
  4. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems 26 (2013).
  5. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
  6. Kevin Donnelly et al. 2006. SNOMED-CT: The advanced terminology and coding system for eHealth. Studies in health technology and informatics 121 (2006), 279.
  7. Jennifer D’Souza and Vincent Ng. 2015. Sieve-based entity linking for the biomedical domain. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 297–302.
  8. Centers for Disease Control. 2007. International Classification of Diseases-9-CM. Available at http://www.cdc.gov/nchs/icd.htm. Accessed Feb, 2024.
  9. Evaluating the UMLS as a source of lexical knowledge for medical language processing.. In Proceedings of the AMIA Symposium. American Medical Informatics Association, 189.
  10. Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855–864.
  11. Ernesto Jiménez-Ruiz and Bernardo Cuenca Grau. 2011. Logmap: Logic-based and scalable ontology matching. In 10th International Semantic Web Conference. 273–288.
  12. MIMIC-III, a freely accessible critical care database. Scientific data 3, 1 (2016), 1–9.
  13. Using rule-based natural language processing to improve disease normalization in biomedical text. Journal of the American Medical Informatics Association 20, 5 (2013), 876–881.
  14. Bilinear attention networks. Advances in neural information processing systems 31 (2018).
  15. What every reader should know about studies using electronic health record data but may be afraid to ask. Journal of medical Internet research 23, 3 (2021), e22219.
  16. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.
  17. Self-alignment pretraining for biomedical entity representations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4228–4238.
  18. Selfkg: Self-supervised entity alignment in knowledge graphs. In Proceedings of the ACM Web Conference 2022. 860–870.
  19. Hiprompt: Few-shot biomedical knowledge fusion via hierarchy-oriented prompting. In 46th International ACM SIGIR Conference on Research and Development in Information Retrieval - Short Paper.
  20. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics 23, 6 (2022), bbac409.
  21. Risk prediction on electronic health records with prior medical knowledge. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1910–1919.
  22. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005 (2022).
  23. Knowledge Enhanced Contextual Word Representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 43–54.
  24. Eric Sven Ristad and Peter N Yianilos. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 5 (1998), 522–532.
  25. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
  26. On the effectiveness of compact biomedical transformers. Bioinformatics 39, 3 (2023), btad103.
  27. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17, 5 (2010), 507–513.
  28. The UMLS Metathesaurus: representing different views of biomedical concepts. Bulletin of the Medical Library Association 81, 2 (1993), 217.
  29. Neural entity linking: A survey of models based on deep learning. Semantic Web 13, 3 (2022), 527–570.
  30. Knowledge-graph-enabled biomedical entity linking: a survey. World Wide Web (2023), 1–30.
  31. Large language models encode clinical knowledge. Nature 620, 7972 (2023), 172–180.
  32. Biomedical discovery through the integrative biomedical knowledge hub (iBKH). Iscience 26, 4 (2023).
  33. Data processing and text mining technologies on electronic medical records: a review. Journal of healthcare engineering 2018 (2018).
  34. Pre-trained language models in biomedical domain: A systematic survey. Comput. Surveys 56, 3 (2023), 1–52.
  35. Exploring the in-context learning ability of large language model for biomedical concept linking. arXiv preprint arXiv:2307.01137 (2023).
  36. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In The Eleventh International Conference on Learning Representations.
  37. William E Winkler. 1990. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. (1990).
  38. A generate-and-rank framework with semantic type regularization for biomedical concept normalization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8452–8464.
  39. Counterfactual and factual reasoning over hypergraphs for interpretable clinical predictions on ehr. In Machine Learning for Health. PMLR, 259–278.
  40. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534 (2023).
  41. Knowledge-Rich Self-Supervision for Biomedical Entity Linking. In Findings of the Association for Computational Linguistics: EMNLP 2022. 868–880.
  42. Graph neural networks: A review of methods and applications. AI open 1 (2020), 57–81.
  43. Large Language Models Are Human-Level Prompt Engineers.(2023). ProQuest Number: INFORMATION TO ALL USERS 30490868 (2023).
Citations (2)

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 1 like about this paper.