Publications by authors named "Noel Masasi"

Swahili corpus is a dataset generated by collecting written Kiswahili sentences from different sectors that deals with Kiswahili documents. Corpus of intended language is needed in Natural Language Processing (NLP) task to fit algorithm in order to understand that language before training the model. Swahili corpus dataset generated contained 1,693,228 sentences with 39,639,824 words and 871,452 unique words.

View Article and Find Full Text PDF

Natural Language Processing requires data to be pre-processed to guarantee quality models in different machine learning tasks. However, Swahili language have been disadvantaged and is classified as low resource language because of inadequate data for NLP especially basic textual datasets that are useful during pre-processing stage. In this article we develop and contribute common Swahili Stop-words, common Swahili Slangs and common Swahili Typos datasets.

View Article and Find Full Text PDF