Publications by Noel Masasi

Publications by authors named "Noel Masasi"

Page 1 of 1

In the heart of Swahili: An exploration of data collection methods and corpus curation for natural language processing.

Data Brief

August 2024

Swahili corpus is a dataset generated by collecting written Kiswahili sentences from different sectors that deals with Kiswahili documents. Corpus of intended language is needed in Natural Language Processing (NLP) task to fit algorithm in order to understand that language before training the model. Swahili corpus dataset generated contained 1,693,228 sentences with 39,639,824 words and 871,452 unique words.

View Article and Find Full Text PDF

Enhancing text pre-processing for Swahili language: Datasets for common Swahili stop-words, slangs and typos with equivalent proper words.

Bernard Masua Noel Masasi

Data Brief

December 2020

Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets.

View Article and Find Full Text PDF