Papers
Topics
Authors
Recent
Search
2000 character limit reached

Building low-resource African language corpora: A case study of Kidawida, Kalenjin and Dholuo

Published 19 Jan 2025 in cs.CL | (2501.11003v1)

Abstract: Natural Language Processing is a crucial frontier in artificial intelligence, with broad applications in many areas, including public health, agriculture, education, and commerce. However, due to the lack of substantial linguistic resources, many African languages remain underrepresented in this digital transformation. This paper presents a case study on the development of linguistic corpora for three under-resourced Kenyan languages, Kidaw'ida, Kalenjin, and Dholuo, with the aim of advancing natural language processing and linguistic research in African communities. Our project, which lasted one year, employed a selective crowd-sourcing methodology to collect text and speech data from native speakers of these languages. Data collection involved (1) recording conversations and translation of the resulting text into Kiswahili, thereby creating parallel corpora, and (2) reading and recording written texts to generate speech corpora. We made these resources freely accessible via open-research platforms, namely Zenodo for the parallel text corpora and Mozilla Common Voice for the speech datasets, thus facilitating ongoing contributions and access for developers to train models and develop Natural Language Processing applications. The project demonstrates how grassroots efforts in corpus building can support the inclusion of African languages in artificial intelligence innovations. In addition to filling resource gaps, these corpora are vital in promoting linguistic diversity and empowering local communities by enabling Natural Language Processing applications tailored to their needs. As African countries like Kenya increasingly embrace digital transformation, developing indigenous language resources becomes essential for inclusive growth. We encourage continued collaboration from native speakers and developers to expand and utilize these corpora.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.