Large Language Models and Arabic Content: A Review

Published 12 May 2025 in cs.CL and cs.AI | (2505.08004v1)

Abstract: Over the past three years, the rapid advancement of LLMs has had a profound impact on multiple areas of AI, particularly in NLP across diverse languages, including Arabic. Although Arabic is considered one of the most widely spoken languages across 27 countries in the Arabic world and used as a second language in some other non-Arabic countries as well, there is still a scarcity of Arabic resources, datasets, and tools. Arabic NLP tasks face various challenges due to the complexities of the Arabic language, including its rich morphology, intricate structure, and diverse writing standards, among other factors. Researchers have been actively addressing these challenges, demonstrating that pre-trained LLMs trained on multilingual corpora achieve significant success in various Arabic NLP tasks. This study provides an overview of using LLMs for the Arabic language, highlighting early pre-trained Arabic LLMs across various NLP applications and their ability to handle diverse Arabic content tasks and dialects. It also provides an overview of how techniques like finetuning and prompt engineering can enhance the performance of these models. Additionally, the study summarizes common Arabic benchmarks and datasets while presenting our observations on the persistent upward trend in the adoption of LLMs.