“More than 2,000 languages spoken in Africa are being neglected in the artificial intelligence (AI) era. For example, ChatGPT recognizes only 10–20% of sentences written in Hausa, a language spoken by 94 million people in Nigeria. These languages are under-represented in large language models (LLMs) because of a lack of training data. But researchers across Africa are changing that.
Language specialists have recorded 9,000 hours of people speaking different African languages and transformed the recordings into digitized language data sets. The researchers, who are part of a project called African Next Voices, released the first tranche of data this month from what is the largest AI-ready language-data-creation initiative for multiple African languages.
The data will be open access and available for developers to incorporate into LLMs, such as those that convert speech into text or provide automatic language translation.”
From Nature.