Quokka: An Open-source Large Language Model ChatBot for Material Science
Abstract: This paper presents the development of a specialized chatbot for materials science, leveraging the Llama-2 LLM, and continuing pre-training on the expansive research articles in the materials science domain from the S2ORC dataset. The methodology involves an initial pretraining phase on over one million domain-specific papers, followed by an instruction-tuning process to refine the chatbot's capabilities. The chatbot is designed to assist researchers, educators, and students by providing instant, context-aware responses to queries in the field of materials science. We make the four trained checkpoints (7B, 13B, with or without chat ability) freely available to the research community at https://github.com/Xianjun-Yang/Quokka.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Tegit: Generating high-quality instruction-tuning data with text-grounded task design. arXiv preprint arXiv:2309.05447, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- M2hub: Unlocking the potential of machine learning for materials discovery. arXiv preprint arXiv:2307.05378, 2023.
- Mol-instructions: A large-scale biomolecular instruction dataset for large language models. arXiv preprint arXiv:2306.08018, 2023.
- The sofc-exp corpus and neural approaches to information extraction in the materials science domain. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1255–1268, 2020.
- Medalpaca - an open-source collection of medical conversational ai models and training data. ArXiv, abs/2304.08247, 2023. URL https://api.semanticscholar.org/CorpusID:258180068.
- Lawyer llama technical report. arXiv preprint arXiv:2305.15062, 2023.
- Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. ArXiv, abs/2303.14070, 2023. URL https://api.semanticscholar.org/CorpusID:257756992.
- S2orc: The semantic scholar open research corpus. arXiv preprint arXiv:1911.02782, 2019.
- OpenAI. Gpt-4 technical report, 2023.
- Ms-mentions: Consistently annotating entity mentions in materials science procedural text. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1337–1352, 2021.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Chatgpt: Optimizing language models for dialogue, 2022.
- Matsci-nlp: Evaluating scientific language models on materials science language tasks using text-to-schema modeling. arXiv preprint arXiv:2305.08264, 2023a.
- Honeybee: Progressive instruction finetuning of large language models for materials science. In Conference on Empirical Methods in Natural Language Processing, 2023b. URL https://api.semanticscholar.org/CorpusID:263909166.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- The impact of domain-specific pre-training on named entity recognition tasks in materials science. Available at SSRN 3950755, 2021.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Pmc-llama: Further finetuning llama on medical papers. ArXiv, abs/2304.14454, 2023. URL https://api.semanticscholar.org/CorpusID:263888272.
- Pcmsp: A dataset for scientific action graphs extraction from polycrystalline materials synthesis procedure text. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6033–6046, 2022.
- Exploring the limits of chatgpt for query or aspect-based text summarization. arXiv preprint arXiv:2302.08081, 2023a.
- Matkb: Semantic search for polycrystalline materials synthesis procedures. In Workshop on”Machine Learning for Materials”ICLR 2023, 2023b.
- Text alignment is an efficient unified model for massive nlp tasks. arXiv preprint arXiv:2307.02729, 2023.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558, 2023.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.