Dataset for Siswati: Parallel textual data for English and Siswati and monolingual textual data for Siswati.

Tanja Gaustad Cindy A McKellar Martin J Puttkammer

Data Brief

Centre for Text Technology, North-West University, South Africa.

Published: June 2024

This data article presents a dataset for Siswati, a Bantu language of the Nguni group that is one of the eleven official South African languages and the official language of Eswatini (together with English). The dataset contains parallel textual data between English and Siswati as well as monolingual data for Siswati and was developed for use as training data for machine translation systems, specifically the Autshumato machine translation project. Both corpora can also be used for development and evaluation of Natural Language Processing (NLP) core technologies for Siswati. In addition, the data lends itself for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and what clean-up was done. It also provides an overview of the number of words contained in the datasets.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11010775	PMC
http://dx.doi.org/10.1016/j.dib.2024.110325	DOI Listing

Publication Analysis

Top Keywords

textual data

dataset siswati

parallel textual

data

data english

english siswati

data siswati

machine translation

siswati

siswati parallel

Similar Publications

Keep bright in the dark: Multimodal emotional effects on donation-based crowdfunding performance and their empathic mechanisms.

Br J Psychol

January 2025

Department of Psychological and Cognitive Sciences, Tsinghua University, Beijing, China.

Rui Guo Guolong Wang Ding Wu Zhen Wu

How to raise donations effectively, especially in the E-era, has puzzled fundraisers and scientists across various disciplines. Our research focuses on donation-based crowdfunding projects and investigates how the emotional valence expressed verbally (in textual descriptions) and visually (in facial images) in project descriptions affects project performance. Study 1 uses field data (N = 3817), grabs project information and descriptions from a top donation-based crowdfunding platform, computes visual and verbal emotional valence using a deep-learning-based affective computing method and analyses how multimodal emotional valence influences donation outcomes.

View Article and Find Full Text PDF

Similar Publications

Multimodal machine learning enables AI chatbot to diagnose ophthalmic diseases and provide high-quality medical responses.

NPJ Digit Med

January 2025

Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai, China.

Ruiqi Ma Qian Cheng Jing Yao Zhiyu Peng Mingxu Yan

Chatbot-based multimodal AI holds promise for collecting medical histories and diagnosing ophthalmic diseases using textual and imaging data. This study developed and evaluated the ChatGPT-powered Intelligent Ophthalmic Multimodal Interactive Diagnostic System (IOMIDS) to enable patient self-diagnosis and self-triage. IOMIDS included a text model and three multimodal models (text + slit-lamp, text + smartphone, text + slit-lamp + smartphone).

View Article and Find Full Text PDF

Similar Publications

Large Language Model Approach for Zero-Shot Information Extraction and Clustering of Japanese Radiology Reports: Algorithm Development and Validation.

JMIR Cancer

January 2025

Division of Radiology and Biomedical Engineering, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.

Yosuke Yamagishi Yuta Nakamura Shouhei Hanaoka Osamu Abe

Background: The application of natural language processing in medicine has increased significantly, including tasks such as information extraction and classification. Natural language processing plays a crucial role in structuring free-form radiology reports, facilitating the interpretation of textual content, and enhancing data utility through clustering techniques. Clustering allows for the identification of similar lesions and disease patterns across a broad dataset, making it useful for aggregating information and discovering new insights in medical imaging.

View Article and Find Full Text PDF

Similar Publications

Time-Series Image-Based Automated Monitoring Framework for Visible Facilities: Focusing on Installation and Retention Period.

Sensors (Basel)

January 2025

Department of Architectural Engineering, Dankook University, 152 Jukjeon-ro, Yongin-si 16890, Republic of Korea.

Seonjun Yoon Hyunsoo Kim

In the construction industry, ensuring the proper installation, retention, and dismantling of temporary structures, such as jack supports, is critical to maintaining safety and project timelines. However, inconsistencies between on-site data and construction documentation remain a significant challenge. To address this, this study proposes an integrated monitoring framework that combines computer vision-based object detection and document recognition techniques.

View Article and Find Full Text PDF

Similar Publications

An Investigation of the Domain Gap in CLIP-Based Person Re-Identification.

Sensors (Basel)

January 2025

Department of Informatics-Science and Engineering (DISI), University of Bologna, 40126 Bologna, Italy.

Andrea Asperti Leonardo Naldi Salvatore Fiorilla

Person re-identification (re-id) is a critical computer vision task aimed at identifying individuals across multiple non-overlapping cameras, with wide-ranging applications in intelligent surveillance systems. Despite recent advances, the domain gap-performance degradation when models encounter unseen datasets-remains a critical challenge. CLIP-based models, leveraging multimodal pre-training, offer potential for mitigating this issue by aligning visual and textual representations.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!