Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers
In this paper, the authors present Babel, a sophisticated multilingual large language model (LLM) aimed at addressing the lack of comprehensive language support in existing open-source LLMs. Babel distinguishes itself by encompassing the top 25 languages by the number of speakers, thereby serving over 90% of the world's population. This extensive language support includes widely spoken but under-resourced languages, often overlooked in the development of many extant multilingual LLMs.
Key Contributions
The paper introduces two variants of Babel: Babel-9B and Babel-83B, designed with distinct objectives and capacities. Babel-9B is optimized for efficient inference and fine-tuning, while Babel-83B sets a new benchmark for open multilingual LLM performance, even competing with commercial models.
Layer Extension Technique: Unlike traditional continue-pretraining methods that increase language model capacity, Babel employs a novel layer extension technique that elevates its performance ceiling by systematically adding new layers to the existing model architecture. This expansion technique retains the architectural integrity of the original model, specifically targeting layers in the model's latter half, which are less sensitive to structural changes. Various initialization strategies for these new layers were explored, with Gaussian noise initialization showing promising results in balancing stability and training adaptability.
Data Optimization: Recognizing the uneven availability of high-quality data across languages, the authors developed an LLM-based quality classifier to enhance the data-cleaning pipeline, ensuring high-quality input data for under-resourced languages. The model's training data was sourced from a diverse set of corpora, including Wikipedia, news articles, and other curated datasets, further refined through deduplication and normalization processes.
Performance Evaluation
The paper provides detailed performance evaluations, comparing Babel models against other leading open and commercial LLMs across various multilingual benchmarks, such as MMMLU, M3Exam, and Flores-200. Babel-9B emerged as the best-performing open-source model in its parameter range on tasks involving multilingual reasoning and understanding, surpassing the established capabilities of models like GLM4-9B and Qwen2.5-7B. Similarly, Babel-83B outperformed or matched commercial giants like GPT-4o on several tasks, including XCOPA and XNLI, positioning itself as a leader in open multilingual LLMs.
Practical and Theoretical Implications
Babel's development advances the accessibility and functionality of LLMs for a global audience. The inclusion of major, under-resourced languages in a single model not only addresses current linguistic disparities but also sets a new standard for inclusivity in AI. This capability holds practical significance for real-world applications, such as multilingual communication tools, translation services, and cross-cultural content development.
Theoretically, the research uncovers new insights into model expansion techniques, particularly the efficacy of layer extension in scaling model architectures without compromising performance. This could inform future explorations into increasing model capacity efficiently, balancing parameters with computational resources.
Future Directions
The paper suggests potential future investigations into broadening the scope of languages and scripts supported by Babel. Additionally, enhancing the supervised fine-tuning process with a more diverse dataset pool and exploring advanced alignment strategies could further boost Babel's performance. These advancements are critical as AI continues to integrate more deeply into multilingual and multicultural settings.
Overall, Babel represents a significant contribution to the ongoing development of multilingual LLMs, striving to democratize NLP applications and improve accessibility for previously underserved linguistic communities.