Quokka: An Open-source Large Language Model ChatBot for Material Science

Published 2 Jan 2024 in cs.CL, cs.AI, and cs.CE | (2401.01089v1)

Abstract: This paper presents the development of a specialized chatbot for materials science, leveraging the Llama-2 LLM, and continuing pre-training on the expansive research articles in the materials science domain from the S2ORC dataset. The methodology involves an initial pretraining phase on over one million domain-specific papers, followed by an instruction-tuning process to refine the chatbot's capabilities. The chatbot is designed to assist researchers, educators, and students by providing instant, context-aware responses to queries in the field of materials science. We make the four trained checkpoints (7B, 13B, with or without chat ability) freely available to the research community at https://github.com/Xianjun-Yang/Quokka.

Abstract PDF HTML Upgrade to Chat

References (32)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Quokka, a domain-specific chatbot developed via instruction tuning on over one million materials science articles.
The methodology combines Llama-2 pre-training with advanced techniques like flash-attention and efficient hardware utilization to optimize performance.
Results reveal reduced training loss and robust zero-shot capabilities, establishing Quokka as a valuable tool for research and education in materials science.

Introduction

The development of specialized artificial intelligence tools is a significant stride forward in the field of computational research, and the field of materials science is no exception. One of the challenges in this area is the lack of domain-specific resources that utilize the power of LLMs. To fill this void, a new specialized chatbot has been developed for materials science, which is based on the Llama-2 LLM pre-trained on over one million research articles in this domain. This bot, named Quokka, is meant to serve the materials science community by providing immediate, relevant responses to inquiries and facilitating research and education.

Methodology and Tools

Informed by related works demonstrating the effectiveness of instruction tuning in LLMs, such as GPT-2 and GPT-3, and leveraging existing open-sourced models including LLaMa-2, the developed chatbot underwent continuing pre-training on an extensive corpus of materials science papers to reinforce domain-specific knowledge. The training process included two critical phases: an initial pre-training stage to assimilate foundational scientific knowledge from the S2ORC dataset, and a subsequent instruction tuning phase using tailored instructions to fine-tune the model’s responses. To ensure the accessibility of this innovation, the research team has generously made available the trained model checkpoints, namely Quokka-7B and Quokka-13B, along with their chatbot counterparts, to the public domain.

Experiment Design and Implementation

The pre-training employed a robust set of texts from the S2ORC dataset complemented with texts from a general dataset to prevent forgetting of general language knowledge. Utilizing advanced hardware like A100 GPUs and techniques like flash-attention and Fully Sharded Data Parallel, the training was performed efficiently. Instruction tuning involved incorporating a curated set of instructions from various sources, ensuring that the trained models could handle both general and specific materials science dialogs effectively. The intelligent design of the experimental framework is evidenced by the detailed strategies, including attention to hyperparameters such as learning rate, batch size, and gradient accumulation, which optimize for time and computational resource efficiency.

Results and Case Studies

The trained models demonstrated a significant reduction in training loss, with the more extensive 13B model reaching a lower final perplexity, signifying a robust understanding of materials science concepts. The instruction tuning phase further refined the model's ability to answer materials science inquiries accurately and contextually. The chatbot showcased remarkable proficiency in zero-shot scenarios, handling a broad range of questions from general material properties to summarizing research articles—while also maintaining ethical considerations such as refusing to engage with unsafe queries.

Conclusion

Quokka stands as a noteworthy contribution to computational materials science, offering researchers, educators, and industry professionals a valuable tool to navigate the rich landscape of academic literature and domain-specific queries. This paper not only presents a novel resource but also sets the stage for future enhancements, including more nuanced instruction tuning and expansion into multimodality to merge language understanding with visual data interpretation. The research acknowledges the support from the UCSB NSF Quantum Foundry and the National Science Foundation, indicating a collaborative effort in the advancement of scientific AI tools.