Analyzing the effect of data preprocessing techniques using machine learning algorithms on the diagnosis of COVID-19.

Gizemnur Erol Betül Uzbaş Cüneyt Yücelbaş Şule Yücelbaş

Concurr Comput

Tarsus University Computer Engineering Department Mersin Turkey.

Published: December 2022

Real-time polymerase chain reaction (RT-PCR) known as the swab test is a diagnostic test that can diagnose COVID-19 disease through respiratory samples in the laboratory. Due to the rapid spread of the coronavirus around the world, the RT-PCR test has become insufficient to get fast results. For this reason, the need for diagnostic methods to fill this gap has arisen and machine learning studies have started in this area. On the other hand, studying medical data is a challenging area because the data it contains is inconsistent, incomplete, difficult to scale, and very large. Additionally, some poor clinical decisions, irrelevant parameters, and limited medical data adversely affect the accuracy of studies performed. Therefore, considering the availability of datasets containing COVID-19 blood parameters, which are less in number than other medical datasets today, it is aimed to improve these existing datasets. In this direction, to obtain more consistent results in COVID-19 machine learning studies, the effect of data preprocessing techniques on the classification of COVID-19 data was investigated in this study. In this study primarily, encoding categorical feature and feature scaling processes were applied to the dataset with 15 features that contain blood data of 279 patients, including gender and age information. Then, the missingness of the dataset was eliminated by using both K-nearest neighbor algorithm (KNN) and chain equations multiple value assignment (MICE) methods. Data balancing has been done with synthetic minority oversampling technique (SMOTE), which is a data balancing method. The effect of data preprocessing techniques on ensemble learning algorithms bagging, AdaBoost, random forest and on popular classifier algorithms KNN classifier, support vector machine, logistic regression, artificial neural network, and decision tree classifiers have been analyzed. The highest accuracies obtained with the bagging classifier were 83.42% and 83.74% with KNN and MICE imputations by applying SMOTE, respectively. On the other hand, the highest accuracy ratio reached with the same classifier without SMOTE was 83.91% for the KNN imputation. In conclusion, certain data preprocessing techniques are examined comparatively and the effect of these data preprocessing techniques on success is presented and the importance of the right combination of data preprocessing to achieve success has been demonstrated by experimental studies.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9874401	PMC
http://dx.doi.org/10.1002/cpe.7393	DOI Listing

Publication Analysis

Top Keywords

data preprocessing

preprocessing techniques

machine learning

data

learning algorithms

learning studies

medical data

data balancing

preprocessing

techniques

Similar Publications

Large annotated ultrasound dataset of non-alcoholic fatty liver from Saudi hospitals for analysis and applications.

Data Brief

February 2025

College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.

Fahad Alshagathrh Mahmood Alzubaidi Khalid Alswat Ali Aldhebaib Bushra Alahmadi

This study presents a comprehensive ultrasound image dataset for Non-Alcoholic Fatty Liver Disease (NAFLD), addressing the critical need for standardized resources in AI-assisted diagnosis. The dataset comprises 10,352 high-resolution ultrasound images from 384 patients collected at King Saud University Medical City and National Guard Health Affairs in Saudi Arabia. Each image is meticulously annotated with NAFLD Activity Score (NAS) fibrosis staging and steatosis grading based on corresponding liver biopsy results.

View Article and Find Full Text PDF

Similar Publications

Improved outcome prediction in acute pancreatitis with generated data and advanced machine learning algorithms.

Turk J Emerg Med

January 2025

Department of Emergency Medicine, Faculty of Medicine, Hacettepe University, Ankara, Türkiye.

Murat Özdede Ali Batur Alp Eren Aksoy

Objectives: Traditional scoring systems have been widely used to predict acute pancreatitis (AP) severity but have limitations in predictive accuracy. This study investigates the use of machine learning (ML) algorithms to improve predictive accuracy in AP.

Methods: A retrospective study was conducted using data from 101 AP patients in a tertiary hospital in Türkiye.

View Article and Find Full Text PDF

Similar Publications

Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.

Front Artif Intell

January 2025

Department of Rehabilitation Medicine, The First Affiliated Hospital of Shenzhen University, The Second People's Hospital of Shenzhen, Shenzhen, Guangdong, China.

Xuehui Fan Ruixue Ye Yan Gao Kaiwen Xue Zeyu Zhang

Background: The Department of Rehabilitation Medicine is key to improving patients' quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models.

View Article and Find Full Text PDF

Similar Publications

SAMURAI: shallow analysis of copy number alterations using a reproducible and integrated bioinformatics pipeline.

Brief Bioinform

November 2024

Department of Biology, University of Padova, Via U.Bassi 58/ B, 35131, Italy.

Sara Potente Diego Boscarino Dino Paladin Sergio Marchini Luca Beltrame

Shallow whole-genome sequencing (sWGS) offers a cost-effective approach to detect copy number alterations (CNAs). However, there remains a gap for a standardized workflow specifically designed for sWGS analysis. To address this need, in this work we present SAMURAI, a bioinformatics pipeline specifically designed for analyzing CNAs from sWGS data in a standardized and reproducible manner.

View Article and Find Full Text PDF

Similar Publications

Protocol to boost the robustness and accuracy of spatial transcriptomics algorithms using ensemble techniques.

STAR Protoc

January 2025

Department of Statistics, University of Georgia, 310 Herty Drive, Athens, GA 30602, USA. Electronic address:

Jiazhang Cai Shushan Wu Huimin Cheng Wenxuan Zhong Guo-Cheng Yuan

Spatial transcriptomics enhances our understanding of cellular organization by mapping gene expression data to precise tissue locations. Here, we present a protocol for using weighted ensemble method for spatial transcriptomics (WEST), which uses ensemble techniques to boost the robustness and accuracy of existing algorithms. We describe steps for preprocessing data, obtaining embeddings from individual algorithms, and ensemble integrating all embeddings as a similarity matrix.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!