The availability of large-scale biobanks linking genetic data, rich phenotypes, and biological measures is a powerful opportunity for scientific discovery. However, real-world collections frequently have extensive missingness. While missing data prediction is possible, performance is significantly impaired by block-wise missingness inherent to many biobanks. To address this, we developed Missingness Adapted Group-wise Informed Clustered (MAGIC)-LASSO which performs hierarchical clustering of variables based on missingness followed by sequential Group LASSO within clusters. Variables are pre-filtered for missingness and balance between training and target sets with final models built using stepwise inclusion of features ranked by completeness. This research has been conducted using the UK Biobank ( > 500 k) to predict unmeasured Alcohol Use Disorders Identification Test (AUDIT) scores. The phenotypic correlation between measured and predicted total score was 0.67 while genetic correlations between independent subjects was high >0.86. Phenotypic and genetic correlations in real data application, as well as simulations, demonstrate the method has significant accuracy and utility for increasing power for genetic loci discovery.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10399453PMC
http://dx.doi.org/10.3389/fgene.2023.1162690DOI Listing

Publication Analysis

Top Keywords

missingness adapted
8
informed clustered
8
clustered magic-lasso
8
power genetic
8
genetic loci
8
loci discovery
8
genetic correlations
8
missingness
6
genetic
5
adapted group
4

Similar Publications

Imputation methods for mixed datasets in bioarchaeology.

Archaeol Anthropol Sci

October 2024

Department of Anthropology, McMaster University, Hamilton, Canada.

Unlabelled: Missing data is a prevalent problem in bioarchaeological research and imputation could provide a promising solution. This work simulated missingness on a control dataset (481 samples × 41 variables) in order to explore imputation methods for mixed data (qualitative and quantitative data). The tested methods included Random Forest (RF), PCA/MCA, factorial analysis for mixed data (FAMD), hotdeck, predictive mean matching (PMM), random samples from observed values (RSOV), and a multi-method (MM) approach for the three missingness mechanisms (MCAR, MAR, and MNAR) at levels of 5%, 10%, 20%, 30%, and 40% missingness.

View Article and Find Full Text PDF

Adaptation and Validation of the Psychological Consequences of Screening Questionnaire (PCQ) for Cognitive Screening in Primary Care.

Med Decis Making

November 2024

Center for Applied Health Research on Aging (CAHRA), Institute for Public Health and Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA.

Background: Context-specific measures with adequate external validity are needed to appropriately determine psychosocial effects related to screening for cognitive impairment.

Methods: Two-hundred adults aged ≥65 y recently completing routine, standardized cognitive screening as part of their Medicare annual wellness visit were administered an adapted version of the Psychological Consequences of Screening Questionnaire (PCQ), composed of negative (PCQ-Neg) and positive (PCQ-Pos) scales. Measure distribution, acceptability, internal consistency, factor structure, and external validity (construct, discriminative, criterion) were analyzed.

View Article and Find Full Text PDF

Targeted proteomics, which includes parallel reaction monitoring (PRM), is typically utilized for more precise detection and quantitation of key proteins and/or pathways derived from complex discovery proteomics datasets. Initial discovery-based analysis using data independent acquisition (DIA) can obtain deep proteome coverage with low data missingness while targeted PRM assays can provide additional benefits in further eliminating missing data and optimizing measurement precision. However, PRM method development from bioinformatic predictions can be tedious and time-consuming because of the DIA output complexity.

View Article and Find Full Text PDF

Multi-modality risk prediction of cardiovascular diseases for breast cancer cohort in the All of Us Research Program.

J Am Med Inform Assoc

December 2024

Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN 55455, United States.

Article Synopsis
  • This study aims to create a predictive model for cardiovascular disease (CVD) in breast cancer survivors using diverse data from the All of Us Research Program, focusing on fairness across different demographics.
  • The researchers developed a universal data pipeline to integrate various data types, such as electronic health records, patient surveys, and genomic information, and applied models like Adaptive Lasso and Random Forest to predict CVD outcomes over a 10-year span.
  • Results show that the Adaptive Lasso model performed well overall, while the Random Forest model was particularly strong for predicting certain events; factors like age and prior heart issues were key predictors, highlighting the importance of social determinants of health in understanding patient outcomes.
View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!