Background: Classification of binary data arises naturally in many clinical applications, such as patient risk stratification through ICD codes. One of the key practical challenges in data classification using machine learning is to avoid overfitting. Overfitting in supervised learning primarily occurs when a model learns random variations from noisy labels in training data rather than the underlying patterns. While traditional methods such as regularization and early stopping have demonstrated effectiveness in interpolation tasks, addressing overfitting in the classification of binary data, in which predictions always amount to extrapolation, demands extrapolation-enhanced strategies. One such approach is hybrid mechanistic/data-driven modeling, which integrates prior knowledge on input features into the learning process, enhancing the model's ability to extrapolate.
Results: We present NoiseCut, a Python package for noise-tolerant classification of binary data by employing a hybrid modeling approach that leverages solutions of defined max-cut problems. In a comparative analysis conducted on synthetically generated binary datasets, NoiseCut exhibits better overfitting prevention compared to the early stopping technique employed by different supervised machine learning algorithms. The noise tolerance of NoiseCut stems from a dropout strategy that leverages prior knowledge of input features and is further enhanced by the integration of max-cut problems into the learning process.
Conclusions: NoiseCut is a Python package for the implementation of hybrid modeling for the classification of binary data. It facilitates the integration of mechanistic knowledge on the input features into learning from data in a structured manner and proves to be a valuable classification tool when the available training data is noisy and/or limited in size. This advantage is especially prominent in medical and biomedical applications where data scarcity and noise are common challenges. The codebase, illustrations, and documentation for NoiseCut are accessible for download at https://pypi.org/project/noisecut/ . The implementation detailed in this paper corresponds to the version 0.2.1 release of the software.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11031902 | PMC |
http://dx.doi.org/10.1186/s12859-024-05769-8 | DOI Listing |
Eur J Obstet Gynecol Reprod Biol
January 2025
Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, University of Southern California, Los Angeles, CA, USA; Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Los Angeles General Medical Center, Los Angeles, CA, USA; Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, USA. Electronic address:
Objective: To assess clinical and obstetric characteristics associated with pregnant patients with a diagnosis of attention-deficit hyperactivity disorder (ADHD).
Methods: This serial cross-sectional study queried the Agency of Healthcare Research and Quality's Healthcare Cost and Utilization Project National Inpatient Sample. The study population was 16,759,786 hospital deliveries from 2016 to 2020.
Sci Rep
January 2025
College of Information Science and Technology, Hainan Normal University, Haikou, 571158, China.
Breast cancer is one of the most aggressive types of cancer, and its early diagnosis is crucial for reducing mortality rates and ensuring timely treatment. Computer-aided diagnosis systems provide automated mammography image processing, interpretation, and grading. However, since the currently existing methods suffer from such issues as overfitting, lack of adaptability, and dependence on massive annotated datasets, the present work introduces a hybrid approach to enhance breast cancer classification accuracy.
View Article and Find Full Text PDFSci Rep
January 2025
Faculty of Engineering, Université de Moncton, Moncton, NB, E1A3E9, Canada.
Diabetes is a growing health concern in developing countries, causing considerable mortality rates. While machine learning (ML) approaches have been widely used to improve early detection and treatment, several studies have shown low classification accuracies due to overfitting, underfitting, and data noise. This research employs parallel and sequential ensemble ML approaches paired with feature selection techniques to boost classification accuracy.
View Article and Find Full Text PDFSensors (Basel)
January 2025
Space Robotics Research Group (SpaceR), Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, L-1855 Luxembourg, Luxembourg.
Malaria remains a global health concern, with 249 million cases and 608,000 deaths being reported by the WHO in 2022. Traditional diagnostic methods often struggle with inconsistent stain quality, lighting variations, and limited resources in endemic regions, making manual detection time-intensive and error-prone. This study introduces an automated system for analyzing Romanowsky-stained thick blood smears, focusing on image quality evaluation, leukocyte detection, and malaria parasite classification.
View Article and Find Full Text PDFSensors (Basel)
January 2025
School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK.
Elephant sound identification is crucial in wildlife conservation and ecological research. The identification of elephant vocalizations provides insights into the behavior, social dynamics, and emotional expressions, leading to elephant conservation. This study addresses elephant sound classification utilizing raw audio processing.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!