Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding
Abstract: In this paper, we present significant advancements in the pretraining of Mistral 7B, a large-scale LLM, using a dataset of 32.6 GB, equivalent to 1.1 billion tokens. We explore the impact of extending the context length, releasing models with context lengths of 4096 and 32768 tokens, and further refining performance with a specialized 16384 context length instruction-tuned model, we called it Malaysian Mistral. Our experiments demonstrate the efficacy of continue pretraining and the influence of extended context lengths on Mistral 7B's language understanding capabilities. Additionally, we release a model specifically tuned with a 16384 context length instruction, showcasing its potential for capturing nuanced language intricacies. Furthermore, our research contributes to the benchmarking of Malaysian Mistral against prominent LLMs, including ChatGPT3.5 and Claude 2. We present compelling results indicating Malaysian Mistral's superior performance on Tatabahasa (Malay grammar) test set, particularly when fine-tuned with instructions. All models released at https://huggingface.co/collections/mesolitica/malaysian-mistral-7b-6528f2ec825f4bba46c1700c
- Attention is all you need, 2023.
- Mistral 7b, 2023.
- Zolkepli Husein. Malaya, natural-language-toolkit library for bahasa malaysia, powered by pytorch. https://github.com/huseinzol05/malaya, 2018.
- Wizardlm: Empowering large language models to follow complex instructions, 2023.
- Enhancing chat language models by scaling high-quality instructional conversations, 2023.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019.
- Wizardcoder: Empowering code large language models with evol-instruct, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.