In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequilibrium (LD). Filter feature selection methods based on information theoretic criteria, are well suited to this challenge and will identify a subset of the original variables that should result in more accurate prediction. However, data collected from cohort studies are often high-dimensional genetic data with potential confounders presenting challenges to feature selection and risk prediction machine learning models. Patients with psoriasis are at high risk of developing a chronic arthritis known as psoriatic arthritis (PsA). The prevalence of PsA in this patient group can be up to 30% and the identification of high risk patients represents an important clinical research which would allow early intervention and a reduction of disability. This also provides us with an ideal scenario for the development of clinical risk prediction models and an opportunity to explore the application of information theoretic criteria methods. In this study, we developed the feature selection and psoriatic arthritis (PsA) risk prediction models that were applied to a cross-sectional genetic dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis (PsC) cases using 2-digit HLA alleles imputed using the SNP2HLA algorithm. We also developed stratification method to mitigate the impact of potential confounder features and illustrate that confounding features impact the feature selection. The mitigated dataset was used in training of seven supervised algorithms. 80% of data was randomly used for training of seven supervised machine learning methods using stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation. The risk prediction models were then further validated in UK Biobank dataset containing data on 1187 participants and a set of features overlapping with the training dataset.Performance of these methods has been evaluated using the area under the curve (AUC), accuracy, precision, recall, F1 score and decision curve analysis(net benefit). The best model is selected based on three criteria: the 'lowest number of feature subset' with the 'maximal average AUC over the nested cross validation' and good generalisability to the UK Biobank dataset. In the original dataset, with over 100 different bootstraps and seven feature selection (FS) methods, HLA_C_*06 was selected as the most informative genetic variant. When the dataset is mitigated the single most important genetic features based on rank was identified as HLA_B_*27 by the seven different feature selection methods, consistent with previous analyses of this data using regression based methods. However, the predictive accuracy of these single features in post mitigation was found to be moderate (AUC= 0.54 (internal cross validation), AUC=0.53 (internal hold out set), AUC=0.55(external data set)). Sequentially adding additional HLA features based on rank improved the performance of the Random Forest classification model where 20 2-digit features selected by Interaction Capping (ICAP) demonstrated (AUC= 0.61 (internal cross validation), AUC=0.57 (internal hold out set), AUC=0.58 (external dataset)). The stratification method for mitigation of confounding features and filter information theoretic feature selection can be applied to a high dimensional dataset with the potential confounders.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8640070PMC
http://dx.doi.org/10.1038/s41598-021-00854-xDOI Listing

Publication Analysis

Top Keywords

feature selection
32
risk prediction
24
prediction models
20
machine learning
12
selection methods
12
cross validation
12
features
10
feature
9
methods
9
application theoretic
8

Similar Publications

A new Donor-Acceptor type pyrazinacene derivative (1) featuring strong ICT was synthesized by linking electron-donating triphenylamine (TPA) and electron-accepting CN groups via a pyrazinacene core. The compound exhibits a dramatic color change from greenish blue to red-violet upon selective recognition of naphthalene (3) to form a 1:1 co-crystal (1•3). This color change is induced by intermolecular CT between pyrazinacene and naphthalene's aromatic moieties, driven by π-hole···π interactions.

View Article and Find Full Text PDF

Caution when using network partners for target identification in drug discovery.

HGG Adv

January 2025

Lady Davis Institute, Jewish General Hospital, McGill University, Montréal, Québec, Canada; Department of Human Genetics, McGill University, Montréal, Québec, Canada; 5 Prime Sciences Inc, Montréal, Quebec, Canada; Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, QC, Canada; Department of Medicine, McGill University, Montréal, Québec, Canada; Department of Twin Research, King's College London, London, UK. Electronic address:

Identifying novel, high-yield drug targets is challenging and often results in a high failure rate. However, recent data indicates that leveraging human genetic evidence to identify and validate these targets significantly increases the likelihood of success in drug development. Two recent papers from Open Targets claimed that around half of FDA-approved drugs had targets with direct human genetic evidence.

View Article and Find Full Text PDF

Retroviral genome selection and virion assembly remain promising targets for novel therapeutic intervention. Recent studies have demonstrated that the Gag proteins of Rous sarcoma virus (RSV) and human immunodeficiency virus type-1 (HIV-1) undergo nuclear trafficking, colocalize with nascent genomic viral RNA (gRNA) at transcription sites, may interact with host transcription factors, and display biophysical properties characteristic of biomolecular condensates. In the present work, we utilized a controlled in vitro condensate assay and advanced imaging approaches to investigate the effects of interactions between RSV Gag condensates and viral and nonviral RNAs on condensate abundance and organization.

View Article and Find Full Text PDF

Background: Marek's disease (MD) is a pathology affecting chickens caused by Marek's disease virus (MDV), an acute transforming alphaherpesvirus of the genus . MD is characterized by paralysis, immune suppression, and the rapid formation of T-cell (primarily CD4+) lymphomas. Over the last 50 years, losses due to MDV infection have been controlled worldwide through vaccination; however, these live-attenuated vaccines are non-sterilizing and potentially contributed to the virulence evolution of MDV field strains.

View Article and Find Full Text PDF

In the 21st century, thanks to advances in biotechnology and developing pharmaceutical technology, significant progress is being made in effective drug design. Drug targeting aims to ensure that the drug acts only in the pathological area; it is defined as the ability to accumulate selectively and quantitatively in the target tissue or organ, regardless of the chemical structure of the active drug substance and the method of administration. With drug targeting, conventional, biotechnological and gene-derived drugs target the body's organs, tissues, and cells that can be selectively transported to specific regions.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!