Missing values (NA) often occur in cancer research, which may be due to reasons such as data protection, data loss, or missing follow-up data. Such incomplete patient information can have an impact on prediction models and other data analyses. Imputation methods are a tool for dealing with NA. Cancer data is often presented in an ordered categorical form, such as tumour grading and staging, which requires special methods. This work compares mode imputation, k nearest neighbour (knn) imputation, and, in the context of Multiple Imputation by Chained Equations (MICE), logistic regression model with proportional odds (mice_polr) and random forest (mice_rf) on a real-world prostate cancer dataset provided by the Cancer Registry of Rhineland-Palatinate in Germany. Our dataset contains relevant information for the risk classification of patients and the time between date of diagnosis and date of death. For the imputation comparison, we use Rubin's (1974) Missing Completely At Random (MCAR) mechanism to remove 10%, 20%, 30%, and 50% observations. The results are evaluated and ranked based on the accuracy per patient. Mice_rf performs significantly best for each percentage of NA, followed by knn, and mice_polr performs significantly worst. Furthermore, our findings indicate that the accuracy of imputation methods increases with a lower number of categories, a relatively even proportion of patients in the categories, or a majority of patients in a particular category.

Download full-text PDF

Source
http://dx.doi.org/10.3233/SHTI240780DOI Listing

Publication Analysis

Top Keywords

imputation methods
12
real-world prostate
8
prostate cancer
8
cancer data
8
data
6
imputation
6
cancer
5
comparison imputation
4
methods
4
methods categorical
4

Similar Publications

Background: Sex steroid hormones are critical for maintaining pregnancy and optimal fetal development. Air pollutants are potential endocrine disruptors that may disturb sex steroidogenesis during pregnancy, potentially leading to adverse health outcomes.

Methods: In the Environmental influences on Child Health Outcomes Understanding Pregnancy Signals and Infant Development pregnancy cohort (Rochester, NY), sex steroid concentrations were collected at study visits in early-, mid-, and late-pregnancy in 299 participants.

View Article and Find Full Text PDF

Determinants of COVID-19 vaccination coverage in European and Organisation for Economic Co-operation and Development (OECD) countries.

Front Public Health

January 2025

Centre for Health Economics Research and Modelling Infectious Diseases, Vaccine and Infectious Disease Institute, University of Antwerp, Antwerp, Belgium.

Introduction: In relatively wealthy countries, substantial between-country variability in COVID-19 vaccination coverage occurred. We aimed to identify influential national-level determinants of COVID-19 vaccine uptake at different COVID-19 pandemic stages in such countries.

Methods: We considered over 50 macro-level demographic, healthcare resource, disease burden, political, socio-economic, labor, cultural, life-style indicators as explanatory factors and coverage with at least one dose by June 2021, completed initial vaccination protocols by December 2021, and booster doses by June 2022 as outcomes.

View Article and Find Full Text PDF

Introduction: The exponential growth of genomic datasets necessitates advanced analytical tools to effectively identify genetic loci from large-scale high throughput sequencing data. This study presents Deep-Block, a multi-stage deep learning framework that incorporates biological knowledge into its AI architecture to identify genetic regions as significantly associated with Alzheimer's disease (AD). The framework employs a three-stage approach: (1) genome segmentation based on linkage disequilibrium (LD) patterns, (2) selection of relevant LD blocks using sparse attention mechanisms, and (3) application of TabNet and Random Forest algorithms to quantify single nucleotide polymorphism (SNP) feature importance, thereby identifying genetic factors contributing to AD risk.

View Article and Find Full Text PDF

Objective: To determine whether BMI differences observed at 5 years of age, from early intervention in infancy, remained apparent at 11 years.

Methods: Participants (n = 734) from the original randomized controlled trial (n = 802) underwent measures of body mass index (BMI), body composition (DXA), sleep and physical activity (24-h accelerometry, questionnaire), diet (repeated 24-h recalls), screen time (daily diaries), wellbeing (CHU-9D, WHO-5), and family functioning (McMaster FAD) around their 11th birthday. Following multiple imputation, regression models explored the effects of two interventions ('Sleep' vs.

View Article and Find Full Text PDF

Background: Coronary heart disease (CHD) is the leading cause of death among adults in Germany. There is evidence that occupational exposure to particulate matter, noise, psychosocial stressors, shift work and high physical workload are associated with CHD. The aim of this study is to identify occupations that are associated with CHD and to elaborate on occupational exposures associated with CHD by using the job exposure matrix (JEM) BAuA-JEM ETB 2018 in a German study population.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!