AfroXLMR-Social: Adapting Pre-trained Language Models for African Languages Social Media Text

Published 24 Mar 2025 in cs.CL | (2503.18247v3)

Abstract: LLMs built from various sources are the foundation of today's NLP progress. However, for many low-resource languages, the diversity of domains is often limited, more biased to a religious domain, which impacts their performance when evaluated on distant and rapidly evolving domains such as social media. Domain adaptive pre-training (DAPT) and task-adaptive pre-training (TAPT) are popular techniques to reduce this bias through continual pre-training for BERT-based models, but they have not been explored for African multilingual encoders. In this paper, we explore DAPT and TAPT continual pre-training approaches for African languages social media domain. We introduce AfriSocial, a large-scale social media and news domain corpus for continual pre-training on several African languages. Leveraging AfriSocial, we show that DAPT consistently improves performance (from 1% to 30% F1 score) on three subjective tasks: sentiment analysis, multi-label emotion, and hate speech classification, covering 19 languages. Similarly, leveraging TAPT on the data from one task enhances performance on other related tasks. For example, training with unlabeled sentiment data (source) for a fine-grained emotion classification task (target) improves the baseline results by an F1 score ranging from 0.55% to 15.11%. Combining these two methods (i.e. DAPT + TAPT) further improves the overall performance. The data and model resources are available at HuggingFace.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AfroXLMR-Social by adapting AfroXLMR with domain adaptive pre-training (DAPT) and task adaptive pre-training (TAPT) for social media text.
It demonstrates significant Macro F1 improvements in sentiment (1%-30%), emotion (0.55%-15.11%), and hate speech tasks using a curated AfriSocial corpus.
The study underscores the importance of localized, culturally specific data in enhancing NLP applications for underrepresented African languages.

Introduction

The paper "AfroXLMR-Social: Adapting Pre-trained LLMs for African Languages Social Media Text" (2503.18247) investigates the adaptation of pre-trained LLMs (PLMs) for African languages within the social media domain. While PLMs like BERT and XLM-RoBERTa have achieved substantial success across a variety of NLP tasks, their initial training corpora often lack sufficient domain-specific content pertinent to African languages. This paper addresses this gap by introducing AfroXLMR-Social through domain adaptive pre-training (DAPT) and task adaptive pre-training (TAPT), leveraging a curated social media and news corpus named AfriSocial.

Methodology

This study employs continual pre-training strategies on the AfroXLMR model, which is originally based on XLM-RoBERTa and tailored for African languages. The adaptation process involves:

Domain Adaptive Pre-training (DAPT): This phase involves using the AfriSocial corpus—comprising social media text and news articles—to enhance the domain relevance of the model. Here, a general-purpose LLM is continually pre-trained on domain-specific unlabeled text to improve performance in social contexts.
Task Adaptive Pre-training (TAPT): In this phase, the model undergoes further training on task-specific data without labels. TAPT is aimed at improving task performance by pre-training the model on the same type of data it will later be fine-tuned on.
Figure 1: Continual pre-training illustration. A general-purpose pretrained model, such as AfroXLMR, is first adapted to a social domain, resulting in AfroXLMR-Social.

The evaluation focuses on three subjective NLP tasks: sentiment analysis, emotion analysis, and hate speech classification, covering 19 African languages. Each task benefits differently from the adaptation techniques applied, with combined DAPT and TAPT approaches showing noticeably improved accuracy.

Results and Analysis

AfroXLMR-Social's performance on the evaluation tasks demonstrates the effectiveness of tailoring pre-trained models via domain-specific continual pre-training. The study reports significant improvements in Macro F1 scores:

Sentiment Analysis (AfriSenti): DAPT yields F1 score improvements between 1% to 30%. TAPT further refines these gains with cross-task adaptability enhancing sentiment analysis accuracy.
Emotion Analysis (AfriEmo): Utilizing DAPT coupled with TAPT on sentiment data improves emotion analysis by 0.55% to 15.11% in Macro F1 score. It shows that task relevance is crucial for performance enhancement.
Hate Speech Detection (AfriHate): Fine-tuning leveraging DAPT and TAPT results in better classification precision, essentially due to improved contextual understanding from domain and task adaptation.
Figure 2: AfriSenti Macro F1 results from AfroXLMR-Social and LLMs.

Figure 3: AfriEmo Macro F1 results from AfroXLMR-Social and LLMs.

Figure 4: AfriHate Macro F1 results from AfroXLMR-Social and LLMs.

Implications and Future Directions

The AfroXLMR-Social framework highlights the necessity of domain-specific adaptation in PLMs, especially for languages with limited resources. Practically, this approach facilitates the development of more accurate and culturally sensitive NLP applications for African languages, which are often underrepresented in mainstream NLP resources.

Looking forward, these findings suggest several avenues for further research:

Broader Domain Exploration: Beyond social media, other domains could further benefit from similar adaptive pre-training methodologies, introducing more linguistic diversity into PLMs.
Extensive Cross-task Evaluation: Future studies could probe deeper into cross-task adaptability impacts across various domains and tasks, refining the balance between computational expense and performance gain.
Large-scale LLM Integration: The integration of such adaptation techniques with large-scale LLMs could potentially close performance gaps between LLMs and domain-specific fine-tuned models.

Conclusion

AfroXLMR-Social successfully demonstrates the value of domain and task adaptive pre-training for enhancing PLM efficacy in African languages. This research underscores the critical role of localized, domain-specific data in achieving robust performance across linguistically diverse and culturally specific contexts. The open distribution of AfriSocial corpus and AfroXLMR-Social models marks a significant contribution to the NLP community, optimizing LLMs for a broader spectrum of languages and dialects.

Markdown Report Issue