Papers
Topics
Authors
Recent
Search
2000 character limit reached

Quokka: An Open-source Large Language Model ChatBot for Material Science

Published 2 Jan 2024 in cs.CL, cs.AI, and cs.CE | (2401.01089v1)

Abstract: This paper presents the development of a specialized chatbot for materials science, leveraging the Llama-2 LLM, and continuing pre-training on the expansive research articles in the materials science domain from the S2ORC dataset. The methodology involves an initial pretraining phase on over one million domain-specific papers, followed by an instruction-tuning process to refine the chatbot's capabilities. The chatbot is designed to assist researchers, educators, and students by providing instant, context-aware responses to queries in the field of materials science. We make the four trained checkpoints (7B, 13B, with or without chat ability) freely available to the research community at https://github.com/Xianjun-Yang/Quokka.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Tegit: Generating high-quality instruction-tuning data with text-grounded task design. arXiv preprint arXiv:2309.05447, 2023.
  4. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  6. M2hub: Unlocking the potential of machine learning for materials discovery. arXiv preprint arXiv:2307.05378, 2023.
  7. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint arXiv:2306.08018, 2023.
  8. The sofc-exp corpus and neural approaches to information extraction in the materials science domain. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1255–1268, 2020.
  9. Medalpaca - an open-source collection of medical conversational ai models and training data. ArXiv, abs/2304.08247, 2023. URL https://api.semanticscholar.org/CorpusID:258180068.
  10. Lawyer llama technical report. arXiv preprint arXiv:2305.15062, 2023.
  11. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. ArXiv, abs/2303.14070, 2023. URL https://api.semanticscholar.org/CorpusID:257756992.
  12. S2orc: The semantic scholar open research corpus. arXiv preprint arXiv:1911.02782, 2019.
  13. OpenAI. Gpt-4 technical report, 2023.
  14. Ms-mentions: Consistently annotating entity mentions in materials science procedural text. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1337–1352, 2021.
  15. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
  16. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  17. Chatgpt: Optimizing language models for dialogue, 2022.
  18. Matsci-nlp: Evaluating scientific language models on materials science language tasks using text-to-schema modeling. arXiv preprint arXiv:2305.08264, 2023a.
  19. Honeybee: Progressive instruction finetuning of large language models for materials science. In Conference on Empirical Methods in Natural Language Processing, 2023b. URL https://api.semanticscholar.org/CorpusID:263909166.
  20. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  21. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  22. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  23. The impact of domain-specific pre-training on named entity recognition tasks in materials science. Available at SSRN 3950755, 2021.
  24. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  25. Pmc-llama: Further finetuning llama on medical papers. ArXiv, abs/2304.14454, 2023. URL https://api.semanticscholar.org/CorpusID:263888272.
  26. Pcmsp: A dataset for scientific action graphs extraction from polycrystalline materials synthesis procedure text. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6033–6046, 2022.
  27. Exploring the limits of chatgpt for query or aspect-based text summarization. arXiv preprint arXiv:2302.08081, 2023a.
  28. Matkb: Semantic search for polycrystalline materials synthesis procedures. In Workshop on”Machine Learning for Materials”ICLR 2023, 2023b.
  29. Text alignment is an efficient unified model for massive nlp tasks. arXiv preprint arXiv:2307.02729, 2023.
  30. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  31. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558, 2023.
  32. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
Citations (1)

Summary

  • The paper introduces Quokka, a domain-specific chatbot developed via instruction tuning on over one million materials science articles.
  • The methodology combines Llama-2 pre-training with advanced techniques like flash-attention and efficient hardware utilization to optimize performance.
  • Results reveal reduced training loss and robust zero-shot capabilities, establishing Quokka as a valuable tool for research and education in materials science.

Introduction

The development of specialized artificial intelligence tools is a significant stride forward in the field of computational research, and the field of materials science is no exception. One of the challenges in this area is the lack of domain-specific resources that utilize the power of LLMs. To fill this void, a new specialized chatbot has been developed for materials science, which is based on the Llama-2 LLM pre-trained on over one million research articles in this domain. This bot, named Quokka, is meant to serve the materials science community by providing immediate, relevant responses to inquiries and facilitating research and education.

Methodology and Tools

Informed by related works demonstrating the effectiveness of instruction tuning in LLMs, such as GPT-2 and GPT-3, and leveraging existing open-sourced models including LLaMa-2, the developed chatbot underwent continuing pre-training on an extensive corpus of materials science papers to reinforce domain-specific knowledge. The training process included two critical phases: an initial pre-training stage to assimilate foundational scientific knowledge from the S2ORC dataset, and a subsequent instruction tuning phase using tailored instructions to fine-tune the model’s responses. To ensure the accessibility of this innovation, the research team has generously made available the trained model checkpoints, namely Quokka-7B and Quokka-13B, along with their chatbot counterparts, to the public domain.

Experiment Design and Implementation

The pre-training employed a robust set of texts from the S2ORC dataset complemented with texts from a general dataset to prevent forgetting of general language knowledge. Utilizing advanced hardware like A100 GPUs and techniques like flash-attention and Fully Sharded Data Parallel, the training was performed efficiently. Instruction tuning involved incorporating a curated set of instructions from various sources, ensuring that the trained models could handle both general and specific materials science dialogs effectively. The intelligent design of the experimental framework is evidenced by the detailed strategies, including attention to hyperparameters such as learning rate, batch size, and gradient accumulation, which optimize for time and computational resource efficiency.

Results and Case Studies

The trained models demonstrated a significant reduction in training loss, with the more extensive 13B model reaching a lower final perplexity, signifying a robust understanding of materials science concepts. The instruction tuning phase further refined the model's ability to answer materials science inquiries accurately and contextually. The chatbot showcased remarkable proficiency in zero-shot scenarios, handling a broad range of questions from general material properties to summarizing research articles—while also maintaining ethical considerations such as refusing to engage with unsafe queries.

Conclusion

Quokka stands as a noteworthy contribution to computational materials science, offering researchers, educators, and industry professionals a valuable tool to navigate the rich landscape of academic literature and domain-specific queries. This paper not only presents a novel resource but also sets the stage for future enhancements, including more nuanced instruction tuning and expansion into multimodality to merge language understanding with visual data interpretation. The research acknowledges the support from the UCSB NSF Quantum Foundry and the National Science Foundation, indicating a collaborative effort in the advancement of scientific AI tools.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.