Feature selection and machine learning algorithms can be used to analyze Single Nucleotide Polymorphisms (SNPs) data and identify potential disease biomarkers. Reproducibility of identified biomarkers is critical for them to be useful for clinical research; however, genotyping platforms and selection criteria for individuals to be genotyped affect the reproducibility of identified biomarkers. To assess biomarkers reproducibility, we collected five SNPs datasets from the database of Genotypes and Phenotypes (dbGaP) and explored several data integration strategies. While combining datasets can lead to a reduction in classification accuracy, it has the potential to improve the reproducibility of potential biomarkers. We evaluated the agreement among different strategies in terms of the SNPs that were identified as potential Parkinson's disease (PD) biomarkers. Our findings indicate that, on average, 93% of the SNPs identified in a single dataset fail to be identified in other datasets. However, through dataset integration, this lack of replication is reduced to 62%. We discovered fifty SNPs that were identified at least twice, which could potentially serve as novel PD biomarkers. These SNPs are indirectly linked to PD in the literature but have not been directly associated with PD before. These findings open up new potential avenues of investigation.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.compbiomed.2024.108407 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!