Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers

Published 2 Mar 2025 in cs.CL and cs.AI | (2503.00865v1)

Abstract: LLMs have revolutionized NLP, yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce $\texttt{Babel}$, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: $\texttt{Babel-9B}$, designed for efficient inference and fine-tuning, and $\texttt{Babel-83B}$, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.

Abstract PDF Upgrade to Chat

Summary

Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers

In this paper, the authors present Babel, a sophisticated multilingual large language model (LLM) aimed at addressing the lack of comprehensive language support in existing open-source LLMs. Babel distinguishes itself by encompassing the top 25 languages by the number of speakers, thereby serving over 90% of the world's population. This extensive language support includes widely spoken but under-resourced languages, often overlooked in the development of many extant multilingual LLMs.

Key Contributions

The paper introduces two variants of Babel: Babel-9B and Babel-83B, designed with distinct objectives and capacities. Babel-9B is optimized for efficient inference and fine-tuning, while Babel-83B sets a new benchmark for open multilingual LLM performance, even competing with commercial models.

Layer Extension Technique: Unlike traditional continue-pretraining methods that increase language model capacity, Babel employs a novel layer extension technique that elevates its performance ceiling by systematically adding new layers to the existing model architecture. This expansion technique retains the architectural integrity of the original model, specifically targeting layers in the model's latter half, which are less sensitive to structural changes. Various initialization strategies for these new layers were explored, with Gaussian noise initialization showing promising results in balancing stability and training adaptability.

Data Optimization: Recognizing the uneven availability of high-quality data across languages, the authors developed an LLM-based quality classifier to enhance the data-cleaning pipeline, ensuring high-quality input data for under-resourced languages. The model's training data was sourced from a diverse set of corpora, including Wikipedia, news articles, and other curated datasets, further refined through deduplication and normalization processes.

Performance Evaluation

The paper provides detailed performance evaluations, comparing Babel models against other leading open and commercial LLMs across various multilingual benchmarks, such as MMMLU, M3Exam, and Flores-200. Babel-9B emerged as the best-performing open-source model in its parameter range on tasks involving multilingual reasoning and understanding, surpassing the established capabilities of models like GLM4-9B and Qwen2.5-7B. Similarly, Babel-83B outperformed or matched commercial giants like GPT-4o on several tasks, including XCOPA and XNLI, positioning itself as a leader in open multilingual LLMs.

Practical and Theoretical Implications

Babel's development advances the accessibility and functionality of LLMs for a global audience. The inclusion of major, under-resourced languages in a single model not only addresses current linguistic disparities but also sets a new standard for inclusivity in AI. This capability holds practical significance for real-world applications, such as multilingual communication tools, translation services, and cross-cultural content development.

Theoretically, the research uncovers new insights into model expansion techniques, particularly the efficacy of layer extension in scaling model architectures without compromising performance. This could inform future explorations into increasing model capacity efficiently, balancing parameters with computational resources.

Future Directions

The paper suggests potential future investigations into broadening the scope of languages and scripts supported by Babel. Additionally, enhancing the supervised fine-tuning process with a more diverse dataset pool and exploring advanced alignment strategies could further boost Babel's performance. These advancements are critical as AI continues to integrate more deeply into multilingual and multicultural settings.

Overall, Babel represents a significant contribution to the ongoing development of multilingual LLMs, striving to democratize NLP applications and improve accessibility for previously underserved linguistic communities.