Taiwan LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model

Published 29 Nov 2023 in cs.CL and cs.AI | (2311.17487v1)

Abstract: In the realm of LLMs, the nuanced linguistic and cultural intricacies of Traditional Chinese, as spoken in Taiwan, have been largely overlooked. This paper introduces Taiwan LLM, a pioneering LLM that specifically caters to the Traditional Chinese language, with a focus on the variant used in Taiwan. Leveraging a comprehensive pretraining corpus and instruction-finetuning datasets, we have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan. Taiwan LLM represents the first of its kind, a model that is not only linguistically accurate but also culturally resonant with its user base. Our evaluations demonstrate that Taiwan LLM achieves superior performance in understanding and generating Traditional Chinese text, outperforming existing models that are predominantly trained on Simplified Chinese or English. The open-source release of Taiwan LLM invites collaboration and further innovation, ensuring that the linguistic diversity of Chinese speakers is embraced and well-served. The model, datasets, and further resources are made publicly available to foster ongoing research and development in this field.

Abstract PDF HTML Upgrade to Chat

References (29)

Citations (15)

View on Semantic Scholar

Summary

The paper presents Taiwan-LLM, which employs a three-phase training approach (cPT, SFT, Feedback SFT) to capture the nuances of Traditional Chinese.
It achieves competitive performance, with a 53.99% average on the TC-Eval benchmark, matching proprietary models while ensuring cultural accuracy.
The study establishes Taiwan-LLM as an open-source benchmark, paving the way for future culturally tailored language models and advanced training methods.

Taiwan-LLM: A Culturally Aligned LLM for Traditional Chinese

The paper introduces Taiwan-LLM, a LLM specifically designed for Traditional Chinese as used in Taiwan. This innovation addresses the overlooked linguistic and cultural aspects inherent to Traditional Chinese, which differ significantly from Simplified Chinese and English, predominantly used in existing LLMs.

Methodological Approach

The development of Taiwan-LLM encompasses a three-phase methodology: Continue-Pretraining (cPT), Supervised Fine-Tuning (SFT), and Feedback Supervised Fine-Tuning (Feedback SFT).

Continue-Pretraining (cPT): This phase involves enhancing a base model with a comprehensive Taiwanese corpus to capture the intricacies of Traditional Chinese.
Supervised Fine-Tuning (SFT): Utilizing a multi-turn dialogue dataset, this phase hones the model's conversational abilities, emphasizing cultural nuances.
Feedback Supervised Fine-Tuning (Feedback SFT): Incorporating user feedback ensures alignment with user preferences, enhancing linguo-cultural relevance.

Experimental Results

Taiwan-LLM exhibits competitive performance, particularly in comparison to proprietary models like GPT-3.5 turbo. On the TC-Eval benchmark suite, the 13-billion parameter version achieves an average performance of 53.99%, effectively aligning with the proprietary benchmarks while ensuring superior handling of Traditional Chinese.

The results underscore the impact of the continue-pretraining phase, improving linguistic accuracy across tasks. Conversely, the inclusion of filtered CommonCrawl data did not contribute positively, underscoring the importance of maintaining high-quality, culturally relevant datasets.

Contribution and Implications

Taiwan-LLM is significant within the landscape of NLP, offering an open-source solution that invites collaboration and further development. The model sets a precedent in addressing the linguistic diversity of Traditional Chinese, offering equitable access to language technologies.

Future Directions

The development of Taiwan-LLM opens avenues for the refinement of similar models tailored for other underrepresented languages. Further exploration into advanced training methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) is suggested for continuing performance enhancement.

Conclusion

Taiwan-LLM marks a crucial step in bridging the technological divide for Traditional Chinese speakers. By focusing on the nuances and cultural contexts, it successfully meets the needs of its target demographic, establishing a benchmark for culturally aligned LLMs.

This work signifies progress towards inclusive language representation in AI, ensuring the preservation and accessibility of linguistic diversity within technological advancements.