This data article presents a dataset for Siswati, a Bantu language of the Nguni group that is one of the eleven official South African languages and the official language of Eswatini (together with English). The dataset contains parallel textual data between English and Siswati as well as monolingual data for Siswati and was developed for use as training data for machine translation systems, specifically the Autshumato machine translation project. Both corpora can also be used for development and evaluation of Natural Language Processing (NLP) core technologies for Siswati. In addition, the data lends itself for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and what clean-up was done. It also provides an overview of the number of words contained in the datasets.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11010775PMC
http://dx.doi.org/10.1016/j.dib.2024.110325DOI Listing

Publication Analysis

Top Keywords

textual data
12
dataset siswati
8
parallel textual
8
data
8
data english
8
english siswati
8
data siswati
8
machine translation
8
siswati
6
siswati parallel
4

Similar Publications

How to raise donations effectively, especially in the E-era, has puzzled fundraisers and scientists across various disciplines. Our research focuses on donation-based crowdfunding projects and investigates how the emotional valence expressed verbally (in textual descriptions) and visually (in facial images) in project descriptions affects project performance. Study 1 uses field data (N = 3817), grabs project information and descriptions from a top donation-based crowdfunding platform, computes visual and verbal emotional valence using a deep-learning-based affective computing method and analyses how multimodal emotional valence influences donation outcomes.

View Article and Find Full Text PDF

Chatbot-based multimodal AI holds promise for collecting medical histories and diagnosing ophthalmic diseases using textual and imaging data. This study developed and evaluated the ChatGPT-powered Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) to enable patient self-diagnosis and self-triage. IOMIDS included a text model and three multimodal models (text + slit-lamp, text + smartphone, text + slit-lamp + smartphone).

View Article and Find Full Text PDF

Background: The application of natural language processing in medicine has increased significantly, including tasks such as information extraction and classification. Natural language processing plays a crucial role in structuring free-form radiology reports, facilitating the interpretation of textual content, and enhancing data utility through clustering techniques. Clustering allows for the identification of similar lesions and disease patterns across a broad dataset, making it useful for aggregating information and discovering new insights in medical imaging.

View Article and Find Full Text PDF

Time-Series Image-Based Automated Monitoring Framework for Visible Facilities: Focusing on Installation and Retention Period.

Sensors (Basel)

January 2025

Department of Architectural Engineering, Dankook University, 152 Jukjeon-ro, Yongin-si 16890, Republic of Korea.

In the construction industry, ensuring the proper installation, retention, and dismantling of temporary structures, such as jack supports, is critical to maintaining safety and project timelines. However, inconsistencies between on-site data and construction documentation remain a significant challenge. To address this, this study proposes an integrated monitoring framework that combines computer vision-based object detection and document recognition techniques.

View Article and Find Full Text PDF

An Investigation of the Domain Gap in CLIP-Based Person Re-Identification.

Sensors (Basel)

January 2025

Department of Informatics-Science and Engineering (DISI), University of Bologna, 40126 Bologna, Italy.

Person re-identification (re-id) is a critical computer vision task aimed at identifying individuals across multiple non-overlapping cameras, with wide-ranging applications in intelligent surveillance systems. Despite recent advances, the domain gap-performance degradation when models encounter unseen datasets-remains a critical challenge. CLIP-based models, leveraging multimodal pre-training, offer potential for mitigating this issue by aligning visual and textual representations.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!