Aurora:Activating Chinese chat capability for Mixtral-8x7B sparse Mixture-of-Experts through Instruction-Tuning

Published 22 Dec 2023 in cs.CL | (2312.14557v2)

Abstract: Existing research has demonstrated that refining LLMs through the utilization of machine-generated instruction-following data empowers these models to exhibit impressive zero-shot capabilities for novel tasks, without requiring human-authored instructions. In this paper, we systematically investigate, preprocess, and integrate three Chinese instruction-following datasets with the aim of enhancing the Chinese conversational capabilities of Mixtral-8x7B sparse Mixture-of-Experts model. Through instruction fine-tuning on this carefully processed dataset, we successfully construct the Mixtral-8x7B sparse Mixture-of-Experts model named "Aurora." To assess the performance of Aurora, we utilize three widely recognized benchmark tests: C-Eval, MMLU, and CMMLU. Empirical studies validate the effectiveness of instruction fine-tuning applied to Mixtral-8x7B sparse Mixture-of-Experts model. This work is pioneering in the execution of instruction fine-tuning on a sparse expert-mixed model, marking a significant breakthrough in enhancing the capabilities of this model architecture. Our code, data and model are publicly available at https://github.com/WangRongsheng/Aurora

Abstract PDF HTML Upgrade to Chat

References (19)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces an instruction-tuning framework that significantly improves Mixtral-8x7B’s Chinese conversational abilities by leveraging curated datasets.
It leverages three high-quality Chinese instruction datasets—alpaca_data_zh_51k, alpaca_gpt4_data_zh, and sharegpt_70k—to fine-tune the model for robust zero-shot dialogue responses.
The approach pioneers the use of LoRA-based weight adaptation with 4-bit matrix operations to optimize GPU memory usage while expanding multilingual capabilities.

Analyzing the Instruction-Tuning Methodology for Enhancing Chinese Conversational Capabilities in Mixtral-8x7B

The paper "Aurora: Activating Chinese Chat Capability for Mixtral-8x7B Sparse Mixture-of-Experts through Instruction-Tuning" represents a significant contribution to the ongoing research in maximizing the potential of LLMs for multilingual applications, particularly focusing on Chinese conversational tasks. The authors meticulously explore the enhancement of Mixtral-8x7B, a sparse Mixture-of-Experts (MoE) model, by leveraging instruction-tuning techniques to improve its zero-shot capabilities for engaging in Chinese-based dialogue.

Core Contributions and Methodology

The research introduces a systematic approach to extending the Chinese conversational capabilities of the Mixtral-8x7B model. Notably, this model is composed of eight experts, each with seven billion parameters, and is engineered to select two experts dynamically for processing input tokens, optimizing computational efficiency. To address limitations in native Chinese task processing, this study adds value through several key contributions:

Dataset Integration and Fine-Tuning: The authors compile and preprocess three distinct Chinese instruction-following datasets: alpaca_data_zh_51k, alpaca_gpt4_data_zh, and sharegpt_70k. These datasets enable the fine-tuning of Mixtral-8x7B to better align with Chinese dialogues. Integration of these datasets is crucial; they are subjected to rigorous cleaning and organized to support multi-domain, high-quality conversational instances. The overall dataset comprises 176,678 interactions.
Model Development and Evaluation: The fine-tuned Mixtral-8x7B, named "Aurora," undergoes evaluation against notable benchmarks including C-Eval, MMLU, and CMMLU. These benchmarks span various subjects and difficulty levels, ensuring robust testing of Aurora's capabilities. Crucially, the empirical results showcase significant improvements in Aurora's performance, particularly in its ability to process and respond to Chinese dialogue prompts.
Novel Instruction-Tuning Application: This work pioneers the execution of instruction-tuning on a sparse expert-mixed model. The approach utilizes a Low-Rank Adaptation (LoRA) strategy to efficiently update model weights while minimizing GPU memory usage, facilitated by 4-bit matrix operations. This methodology substantiates the application of instruction-tuning to sparse models, thereby expanding their applicability to diverse linguistic contexts.

Implications and Future Directions

Aurora's enhancements highlight the practical utility of instruction-tuning sparse MoE models for language-specific tasks. By adopting comprehensive datasets and utilizing efficient weight adaptation techniques, Aurora achieves competitive performance across diverse linguistic benchmarks. The study sets a precedent for future exploration and development of multilingual capabilities within sparse models, encouraging the development of LLMs like Aurora that align with human interaction requirements more effectively.

From a theoretical perspective, this paper supports the growing body of evidence that instruction-tuning significantly augments LLMs' generalization abilities. It invites speculation that future advancements in this domain could include dynamically adaptive models capable of real-time multilingual translation and interaction. The study elucidates a promising trajectory for enhancing LLMs' capabilities through efficient resource optimization and effective utilization of localized datasets.

Overall, this research not only advances the field of multilingual LLM applications but also paves the way for more sophisticated implementations of instruction-tuning methodologies, fostering greater inclusivity in natural language processing across diverse linguistic landscapes.

Markdown Report Issue