Papers
Topics
Authors
Recent
Search
2000 character limit reached

Extracting Polymer Nanocomposite Samples from Full-Length Documents

Published 1 Mar 2024 in cs.CL | (2403.00260v1)

Abstract: This paper investigates the use of LLMs for extracting sample lists of polymer nanocomposites (PNCs) from full-length materials science research papers. The challenge lies in the complex nature of PNC samples, which have numerous attributes scattered throughout the text. The complexity of annotating detailed information on PNCs limits the availability of data, making conventional document-level relation extraction techniques impractical due to the challenge in creating comprehensive named entity span annotations. To address this, we introduce a new benchmark and an evaluation technique for this task and explore different prompting strategies in a zero-shot manner. We also incorporate self-consistency to improve the performance. Our findings show that even advanced LLMs struggle to extract all of the samples from an article. Finally, we analyze the errors encountered in this process, categorizing them into three main challenges, and discuss potential strategies for future research to overcome them.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Polyie: A dataset of information extraction from polymer material scientific literature.
  2. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  3. Anze Xie Ying Sheng Lianmin Zheng Joseph E. Gonzalez Ion Stoica Xuezhe Ma Dacheng Li*, Rulin Shao* and Hao Zhang. 2023. How long can open-source llms truly promise on context length?
  4. Viscoelastic behavior and electrical properties of flexible nanofiber filled polymer nanocomposites. influence of processing conditions. Composites Science and Technology, 67(5):829–839. Carbon Nanotube (CNT) - Polymer Composites.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  6. Structured information extraction from complex scientific text with fine-tuned large language models. ArXiv, abs/2212.05238.
  7. A sequence-to-sequence approach for document-level relation extraction. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 10–25, Dublin, Ireland. Association for Computational Linguistics.
  8. Matscibert: A materials domain language model for text mining and information extraction. npj Computational Materials, 8(1):102.
  9. Data-driven materials science: Status, challenges, and perspectives. Advanced Science, 6(21).
  10. Reconstructing materials tetrahedron: Challenges in materials information extraction.
  11. Hiroyuki Shindo Yuji Matsumoto Hiroyuki Oka, Atsushi Yoshizawa and Masashi Ishii. 2021. Machine extraction of polymer data from tables using xml versions of scientific articles. Science and Technology of Advanced Materials: Methods, 1(1):12–23.
  12. Foundation models of scientific knowledge for chemistry: Opportunities, challenges and lessons learned. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 160–172.
  13. Identification of tasks, datasets, evaluation metrics, and numeric scores for scientific leaderboards construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5203–5213, Florence, Italy. Association for Computational Linguistics.
  14. Chemprops: A restful api enabled database for composite polymer name standardization. Journal of Cheminformatics, 13(1):22.
  15. SciREX: A challenge dataset for document-level information extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7506–7516, Online. Association for Computational Linguistics.
  16. Cross-domain NER using cross-domain language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2464–2474, Florence, Italy. Association for Computational Linguistics.
  17. Document-level n-ary relation extraction with multiscale representation learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3693–3704, Minneapolis, Minnesota. Association for Computational Linguistics.
  18. H. W. Kuhn. 1955. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97.
  19. Nanomine: A knowledge graph for nanocomposite materials science. In The Semantic Web – ISWC 2020, pages 144–159, Cham. Springer International Publishing.
  20. Scaling deep learning for materials discovery. Nature, 624(7990):80–85.
  21. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  22. OpenAI. 2023. Gpt-4 technical report.
  23. Cross-sentence n-ary relation extraction with graph LSTMs. Transactions of the Association for Computational Linguistics, 5:101–115.
  24. A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing. npj Computational Materials, 9(1).
  25. Matsci-nlp: Evaluating scientific language models on materials science language tasks using text-to-schema modeling. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL).
  26. Honeybee: Progressive instruction finetuning of large language models for materials science.
  27. Matthew C. Swain and Jacqueline M. Cole. 2016. Chemdataextractor: A toolkit for automated extraction of chemical information from the scientific literature. Journal of Chemical Information and Modeling, 56(10):1894–1904. PMID: 27669338.
  28. Does synthetic data generation of llms help clinical text mining?
  29. Creating training data for scientific named entity recognition with minimal human effort.
  30. Llama 2: Open foundation and fine-tuned chat models.
  31. Focused transformer: Contrastive training for context scaling.
  32. CitationIE: Leveraging the citation graph for scientific information extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 719–731, Online. Association for Computational Linguistics.
  33. Revisiting relation extraction in the era of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15566–15589, Toronto, Canada. Association for Computational Linguistics.
  34. Gpt-ner: Named entity recognition via large language models.
  35. Self-consistency improves chain of thought reasoning in language models.
  36. Large language models as master key: Unlocking the secrets of materials science with gpt.
  37. Huichen Yang. 2022. Piekm: Ml-based procedural information extraction and knowledge management system for materials science literature. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations, pages 57–62.
  38. Aligning instruction tasks unlocks large language models as zero-shot relation extractors. In Findings of the Association for Computational Linguistics: ACL 2023, pages 794–812, Toronto, Canada. Association for Computational Linguistics.
  39. NanoMine schema: An extensible data representation for polymer nanocomposites. APL Materials, 6(11):111108.
  40. Zexuan Zhong and Danqi Chen. 2021. A frustratingly easy approach for entity and relation extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 50–61, Online. Association for Computational Linguistics.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 5 likes about this paper.