RakutenAI-7B: Extending Large Language Models for Japanese

Published 21 Mar 2024 in cs.CL and cs.LG | (2403.15484v1)

Abstract: We introduce RakutenAI-7B, a suite of Japanese-oriented LLMs that achieve the best performance on the Japanese LM Harness benchmarks among the open 7B models. Along with the foundation model, we release instruction- and chat-tuned models, RakutenAI-7B-instruct and RakutenAI-7B-chat respectively, under the Apache 2.0 license.

Abstract PDF HTML Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a specialized suite of Japanese LLMs that significantly enhance tokenization and benchmark performance.
It details an advanced methodology featuring an extended tokenizer with 16,000 additional tokens and training on 175 billion filtered tokens.
The models, including instruct and chat variants, are fine-tuned for natural language tasks while ensuring safety and high-quality responses in both Japanese and English.

Extensive Evaluation and Release of RakutenAI-7B: A Suite of Japanese Oriented LLMs

Introduction to RakutenAI-7B

The development of LLMs has predominantly focused on the English language, leaving a significant gap in resources and technology for other languages, including Japanese. RakutenAI-7B represents a pivotal advancement in closing this gap by introducing a suite of Japanese-oriented LLMs. These models are not only designed to excel in understanding and generating Japanese text but also maintain competitive performance on English datasets. Among the models developed are RakutenAI-7B, RakutenAI-7B-instruct, and RakutenAI-7B-chat, each serving different specialized purposes for diverse applications. These models leverage the Mistral model architecture, receive an upgraded tokenizer for enhanced Japanese language processing, and have demonstrated superior performance on various benchmarking tests.

Enhancements in Tokenization and Pre-training

Tokenizer Extension

One of the critical advancements introduced by RakutenAI-7B is the extension of the Mistral tokenizer. The original Mistral tokenizer often struggled with efficiently encoding Japanese characters, particularly kanji, due to its limited vocabulary size. By adding 16,000 additional tokens specifically tailored for Japanese, the extended tokenizer now encompasses 48,000 tokens, significantly improving the character-per-token rate for Japanese text. This enhancement addresses the limitations of context size and processing efficiency within LLMs when dealing with Japanese language inputs.

Foundation Model Training

RakutenAI-7B incorporated advanced data filtering techniques during its training phase to elevate the available internet-scale datasets' quality for both Japanese and English. The filtering procedure, which included normalization, deduplication, and classification stages, facilitated the removal of any personally identifiable information and ensured the training of the models on high-quality data. The model training utilized approximately 175 billion tokens of this filtered data.

Model Fine-Tuning and Evaluation

Fine-tuning for Specialized Uses

Building on the foundation model, Rakuten Group Inc. developed RakutenAI-7B-instruct and RakutenAI-7B-chat through fine-tuning with a mix of open-source and proprietary datasets. The principal aim was to refine RakutenAI-7B-instruct for improved adherence to instructions and prompts, and to enhance RakutenAI-7B-chat to generate more natural, conversational responses. An emphasis was also placed on safety, tuning both models on datasets designed to mitigate the production of content that could be deemed explicit, offensive, or biased.

Comprehensive Evaluation

The models underwent extensive evaluation across a suite of both Japanese and English NLP tasks, employed through the LLM Evaluation Harness (LM-Harness). RakutenAI-7B showcased formidable performance, consistently outperforming other open 7B models in both languages across an array of tests including common sense reasoning, natural language inference, sentiment analysis, reading comprehension, and mathematical problem solving. Notably, for Japanese tasks, RakutenAI-7B achieved an average score of 62.83, presenting a significant lead over other models.

Conclusion and Future Implications

The introduction and release of RakutenAI-7B, along with its specialized variants, marks a significant step forward in the development of Japanese-oriented LLMs. The attention to detail in tokenizer extension, meticulous foundation model training, and specialized fine-tuning underscore the advanced capability of these models in processing and generating Japanese text. The models' release under the Apache 2.0 License broadens their accessibility for further research, development, and application across various domains.

The implications of this work extend beyond the practical applications of these models. They highlight the potential for more inclusive language technology development, extending the benefits of LLM advancements to languages beyond English. Future research can build on this work to explore even more languages and dialects, further democratizing access to cutting-edge AI technology.

Acknowledgements

The acknowledgment section shed light on the collaborative effort behind RakutenAI-7B, thanking Rakuten Group, Inc. for their support and the broader research community for their contributions to the foundation model research. This collective effort underscores the importance of collaboration and open-source ethos in driving the field of AI forward.