Kuwain 1.5B: An Arabic SLM via Language Injection

Published 21 Apr 2025 in cs.CL and cs.AI | (2504.15120v1)

Abstract: Enhancing existing models with new knowledge is a crucial aspect of AI development. This paper introduces a novel method for integrating a new language into a LLM. Our approach successfully incorporates a previously unseen target language into an existing LLM without compromising its prior knowledge. We trained a tiny model with 1.5 billion parameters named Kuwain by injecting the Arabic language into a small open-source model mainly trained in English. Our method demonstrates significant improvements in Arabic language performance, with an average 8% improvement across various benchmarks, while retaining the model's existing knowledge with a minimum amount of the original model's data. This offers a cost-effective alternative to training a comprehensive model in both English and Arabic. The results highlight the potential for efficient, targeted LLM expansion without extensive retraining or resource-intensive processes.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

Kuwain 1.5B: An Arabic Small LLM Approach via Language Injection

The paper "Kuwain 1.5B: An Arabic SLM via Language Injection" introduces a method for language integration into LLMs with minimal data and computational resources. The Kuwain model, with 1.5 billion parameters, exemplifies the application of this approach, presenting itself as a competitive Arabic-centric LLM that embeds Arabic language capabilities into an existing LLM primarily trained on English. This methodology significantly enhances the model's abilities in Arabic, achieving commendable results while maintaining English performance.

Overview of Kuwain 1.5B Integration Methodology

The primary advancement highlighted in this research lies in the effective and efficient extension of a monolingual LLM to support a new language, specifically Arabic, while preserving its original strengths. The Kuwain model demonstrates an efficient approach by integrating new language capabilities through a unique model architecture involving language injection, wherein new layers were strategically added instead of retraining the full model. This design maintains the existing knowledge of English, simultaneously allowing Arabic language acquisition with minimized costs.

Key results include an 8% improvement in Arabic tasks across several benchmarks, underlining the method's efficacy. More importantly, the Kuwain model retains its original language capabilities, with only a slight performance improvement of 1% in English, thus emphasizing the approach's relevance in resource-constrained conditions where computational and financial costs are limiting factors.

Implications and Future Directions

The research introduces a novel dimension to multilingual model development that can be extended to training models with additional languages. By selectively training only the newly incorporated model layers and expanding the tokenizer's vocabulary, Kuwain showcases reduced training costs—reported at 70% less than traditional methods—without sacrificing multilingual processing capabilities.

From a theoretical standpoint, this method offers an innovative paradigm for the future of adaptive and dynamic model training, where models can grow more diverse and inclusive linguistically while respecting computational constraints. Practically speaking, this strategy may significantly lower the barriers to developing capable multilingual models, particularly for underrepresented languages.

Given its results, this study sets the groundwork for expanding LLM capabilities cost-effectively across a variety of LLMs beyond the existing TinyLlama base model analyzed. This positions Kuwain as an advantageous option for those looking to build LLMs that not only broaden their linguistic competencies but do so sustainably and efficiently.

Conclusion

The findings suggest that a strategic combination of vocabulary expansion, targeted training of select model layers, and considerate preservation of existing knowledge can yield a robust multilingual model. Kuwain 1.5B, in this regard, not only meets the benchmarks associated with expansive LLM applications but also emerges as a model for developing more accessible AI systems supporting diverse linguistic data without incurring exponential resource demands. The implications for future LLM development are promising, encouraging further evaluation and testing in broader language and model contexts.

Markdown Report Issue