An Empirical Study of Instruction-tuning Large Language Models in Chinese

Published 11 Oct 2023 in cs.CL and cs.AI | (2310.07328v2)

Abstract: The success of ChatGPT validates the potential of LLMs in artificial general intelligence (AGI). Subsequently, the release of LLMs has sparked the open-source community's interest in instruction-tuning, which is deemed to accelerate ChatGPT's replication process. However, research on instruction-tuning LLMs in Chinese, the world's most spoken language, is still in its early stages. Therefore, this paper makes an in-depth empirical study of instruction-tuning LLMs in Chinese, which can serve as a cookbook that provides valuable findings for effectively customizing LLMs that can better respond to Chinese instructions. Specifically, we systematically explore the impact of LLM bases, parameter-efficient methods, instruction data types, which are the three most important elements for instruction-tuning. Besides, we also conduct experiment to study the impact of other factors, e.g., chain-of-thought data and human-value alignment. We hope that this empirical study can make a modest contribution to the open Chinese version of ChatGPT. This paper will release a powerful Chinese LLMs that is comparable to ChatGLM. The code and data are available at https://github.com/PhoebusSi/Alpaca-CoT.

Abstract PDF HTML Upgrade to Chat

Citations (15)

View on Semantic Scholar

Summary

The paper presents a comprehensive empirical analysis of instruction-tuning for Chinese LLMs, emphasizing key components like LLM bases, tuning methods, and dataset quality.
It reveals that parameter-efficient methods such as LoRA significantly boost performance while keeping the parameter footprint manageable.
Results indicate that using diverse instruction datasets and native language prompts greatly improves model adaptability and competitiveness.

An Empirical Study of Instruction-tuning LLMs in Chinese

The paper provides a comprehensive empirical analysis of instruction-tuning LLMs in Chinese, a research area that remains underexplored despite the prominence of Chinese as the most spoken language globally. This study is positioned within the context of burgeoning interest in LLMs following the success of models like ChatGPT and LLaMA, with a focus on creating a Chinese-centric analog to these technologies.

Core Components of Instruction-Tuning

The authors identify three critical components in the instruction-tuning process: LLM bases, parameter-efficient methods, and instruction datasets. By systematically examining these elements, the study aims to optimize the instruction-following capabilities of LLMs tailored for Chinese.

LLM Bases: The paper evaluates various open LLMs, such as LLaMA, Bloom, and Moss, highlighting Bloom's balanced performance across benchmarks due to its multilingual nature. In contrast, models like Vicuna and ChatGLM, although initially robust, show mixed results when subjected to further tuning using Alpaca-GPT4.
Parameter-Efficient Methods: The study assesses multiple methods, including LoRA, AdaLoRA, and prefix-tuning. LoRA emerges as a particularly effective approach, offering significant improvements with a manageable parameter footprint.
Instruction Datasets: Diverse datasets, such as Alpaca-GPT4, Belle, and ShareGPT-zh, contribute varied strengths to the models. Belle's large dataset provides substantial gains, while ChatGPT-generated datasets improve broad instruction-following tasks.

Additional Factors Influencing Model Performance

The study further explores several ancillary factors:

Chain-of-Thought (CoT) Data: Incorporating CoT data can enhance reasoning abilities, beneficial for complex tasks, though with occasional trade-offs in broader performance.
Vocabulary Expansion: Expanding Chinese vocabulary in models like LLaMA requires subsequent pre-training to be effective, underscoring the importance of tailored linguistic adaptation.
Prompt Language: Instruction-tuning benefits from using native language prompts for models less attuned to Chinese, while models with established multilingual capabilities, like Bloom, perform well with English prompts.
Human Value Alignment: The integration of human-value alignment data can lead to a minor drop in model performance, indicating a delicate balance between ethical considerations and technical efficacy.

Evaluation and Results

The models are evaluated using two benchmarks: Belle-eval for general instruction-following and MMCU for professional knowledge assessment. Results show that Chinese-centric instruction-tuning significantly enhances performance across tasks, with the newly released model rivaling existing models like ChatGLM, despite using fewer trainable parameters.

Implications and Future Directions

The findings from this study have both practical and theoretical implications, particularly in customizing LLMs for Chinese applications while maintaining efficiency through parameter-efficient tuning methods. The authors' approach provides a methodological framework for future research, encouraging further exploration into optimizing LLMs with high-quality, diverse datasets.

Moreover, as AI continues to evolve, the adaptability of LLMs to different linguistic and cultural contexts will remain a priority. This research sets a precedent for further development of LLMs in various languages, emphasizing the need for detailed empirical analysis and methodical execution in fine-tuning processes.

Conclusion

This paper contributes valuable insights into the instruction-tuning of LLMs in Chinese, demonstrating the importance of model selection, dataset quality, and efficient tuning methodologies. The release of a competitive Chinese LLM underscores the potential to advance AI capabilities in nuanced and multilingual environments. Future research can build on these findings to enhance model robustness and cultural adaptability in LLMs globally.