This data article describes a machine translation training data set for translation between English and Tshivenḓa. The data set contains parallel, aligned English-Tshivenḓa data as well as monolingual Tshivenḓa data. The data was collected from both web crawling of multilingual South African government sites and matched documents from translators or publishing sources.
View Article and Find Full Text PDFThis data article presents a dataset for Siswati, a Bantu language of the Nguni group that is one of the eleven official South African languages and the official language of Eswatini (together with English). The dataset contains parallel textual data between English and Siswati as well as monolingual data for Siswati and was developed for use as training data for machine translation systems, specifically the Autshumato machine translation project. Both corpora can also be used for development and evaluation of Natural Language Processing (NLP) core technologies for Siswati.
View Article and Find Full Text PDFThis data article presents a linguistically annotated data set for four official South African languages with a conjunctive orthography, namely isiNdebele, isiXhosa, isiZulu and Siswati. The data set is parallel for all four languages and can be used for language-specific as well as cross-language development and evaluation of Natural Language Processing (NLP) core technologies. In addition, it can be used for corpus linguistic studies.
View Article and Find Full Text PDF