AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

Published 14 Jan 2025 in cs.CL | (2501.08284v2)

Abstract: Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by native speakers familiar with the local culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. The datasets, individual annotations, and hate speech and offensive language lexicons are available on https://github.com/AfriHate/AfriHate

Abstract PDF Upgrade to Chat

Summary

The paper introduces AfriHate, a multilingual dataset that targets hate speech and abusive language detection across 15 African languages.
It employs a rigorous data collection and annotation process using native speakers and social media inputs to capture cultural nuances.
Baseline evaluations reveal strong performance from models like AfroXLMR-76L and GPT-4 in few-shot settings, highlighting improved moderation potential.

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

The paper "AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages" addresses critical gaps in the availability of multilingual datasets for hate speech and abusive language detection in African languages. The study orchestrates the development of AfriHate, a diverse multilingual collection specifically curated to aid in identifying and moderating hate speech in 15 African languages. These languages include Algerian Arabic, Amharic, Igbo, Kinyarwanda, Hausa, Moroccan Arabic, Nigerian Pidgin, Oromo, Somali, Swahili, Tigrinya, Twi, isiXhosa, Yoruba, and isiZulu.

Data Collection & Annotation Process

The authors outline comprehensive methodologies employed in collecting and annotating datasets. A noteworthy aspect of the study is its reliance on native speakers to annotate each dataset. These annotators possess intrinsic cultural and contextual understanding, thereby ensuring the accuracy and relatability of the labels—hate, abusive/offensive, or neutral. The data collection mechanism utilized involved streamlining data through keywords, hashtags, user handles, and locations across social media, primarily leveraging the Twitter API. The inclusion and challenges surrounding under-resourced African languages are meticulously explored.

Baseline Models & Results

In assessing the performance of various models on the curated datasets, the study employs a suite of Africa-centric pre-trained LLMs (PLMs) and examines them against LLMs in both few-shot and zero-shot settings. AfroXLMR-76L emerges as a robust model with an average macro F1 score of approximately 78.16 in multilingual settings. In comparison with LLMs, GPT-4o marks a significant gap in few-shot settings with an improvement to a F1 score of 71.71 at 20-shot training, surpassing other models in terms of adaptability to multilingual hate speech detection in these languages.

Implications and Future Work

The introduction of AfriHate marks an instrumental step forward in filling the data scarcity gap concerning African languages for hate speech and abusive language moderation systems. The dataset serves as a critical resource for developing and testing LLMs tailored to the specific socio-cultural nuances inherent in African languages. Furthermore, the implications extend beyond practical applications to theoretical considerations as the dataset provides a basis for exploring algorithmic fairness and bias in multilingual NLP applications.

Future work, as postulated by the authors, will likely hone in on refining the balance among language representations within the dataset, as well as expanding on the inclusion of additional low-resource African languages. Moreover, exploring cross-lingual transfer learning potential and model generalization presents a prospective trajectory. The paper encourages continued collaboration with African communities to align dataset development with cultural sensitivities and ethical standards in AI research.

Conclusion

The AfriHate collection represents a substantial contribution to multilingual NLP resources, paving the way for nuanced analysis and moderation of hate speech across diverse cultural and linguistic landscapes within Africa. Through detailed annotation and model evaluation, the paper sets a benchmark in hate speech detection, showcasing the need for localized, culturally-sensitive datasets to address region-specific challenges in online content moderation.

Markdown Report Issue