Machine learning random forest for predicting oncosomatic variant NGS analysis.

Eric Pellegrino Coralie Jacques Nathalie Beaufils Isabelle Nanni Antoine Carlioz Philippe Metellus L'Houcine Ouafik

Sci Rep

APHM, CHU Nord, Service d'Onco-Biologie, Marseille, France.

Published: November 2021

Since 2017, we have used IonTorrent NGS platform in our hospital to diagnose and treat cancer. Analyzing variants at each run requires considerable time, and we are still struggling with some variants that appear correct on the metrics at first, but are found to be negative upon further investigation. Can any machine learning algorithm (ML) help us classify NGS variants? This has led us to investigate which ML can fit our NGS data and to develop a tool that can be routinely implemented to help biologists. Currently, one of the greatest challenges in medicine is processing a significant quantity of data. This is particularly true in molecular biology with the advantage of next-generation sequencing (NGS) for profiling and identifying molecular tumors and their treatment. In addition to bioinformatics pipelines, artificial intelligence (AI) can be valuable in helping to analyze mutation variants. Generating sequencing data from patient DNA samples has become easy to perform in clinical trials. However, analyzing the massive quantities of genomic or transcriptomic data and extracting the key biomarkers associated with a clinical response to a specific therapy requires a formidable combination of scientific expertise, biomolecular skills and a panel of bioinformatic and biostatistic tools, in which artificial intelligence is now successful in developing future routine diagnostics. However, cancer genome complexity and technical artifacts make identifying real variants challenging. We present a machine learning method for classifying pathogenic single nucleotide variants (SNVs), single nucleotide polymorphisms (SNPs), multiple nucleotide variants (MNVs), insertions, and deletions detected by NGS from different types of tumor specimens, such as: colorectal, melanoma, lung and glioma cancer. We compared our NGS data to different machine learning algorithms using the k-fold cross-validation method and to neural networks (deep learning) to measure the performance of the different ML algorithms and determine which one is a valid model for confirming NGS variant calls in cancer diagnosis. We trained our machine learning with 70% of our data samples, extracted from our local database (our data structure had 7 parameters: chromosome, position, exon, variant allele frequency, minor allele frequency, coverage and protein description) and validated it with the 30% remaining data. The model offering the best accuracy was chosen and implemented in the NGS analysis routine. Artificial intelligence was developed with the R script language version 3.6.0. We trained our model on 70% of 102,011 variants. Our best error rate (0.22%) was found with random forest machine learning (ntree = 500 and mtry = 4), with an AUC of 0.99. Neural networks achieved some good scores. The final trained model with the neural network achieved an accuracy of 98% and an ROC-AUC of 0.99 with validation data. We tested our RF model to interpret more than 2000 variants from our NGS database: 20 variants were misclassified (error rate < 1%). The errors were nomenclature problems and false positives. After adding false positives to our training database and implementing our RF model routinely, our error rate was always < 0.5%. The RF model shows excellent results for oncosomatic NGS interpretation and can easily be implemented in other molecular biology laboratories. AI is becoming increasingly important in molecular biomedical analysis and can be very helpful in processing medical data. Neural networks show a good capacity in variant classification, and in the future, they may be useful in predicting more complex variants.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8575902	PMC
http://dx.doi.org/10.1038/s41598-021-01253-y	DOI Listing

Publication Analysis

Top Keywords

machine learning

artificial intelligence

ngs

variants

data

random forest

ngs analysis

ngs data

single nucleotide

nucleotide variants

Similar Publications

Coronary health index based on immunoglobulin light chains to assess coronary heart disease risk with machine learning: a diagnostic trial.

J Transl Med

January 2025

Department of Clinical Laboratory, The First Hospital of Jilin University, Changchun, 130000, China.

Wenbo Ren Zichen Zhang Yifei Wang Jiangyuan Wang Li Li

Background: Recent studies suggest a connection between immunoglobulin light chains (IgLCs) and coronary heart disease (CHD). However, current diagnostic methods using peripheral blood IgLCs levels or subtype ratios show limited accuracy for CHD, lacking comprehensive assessment and posing challenges in early detection and precise disease severity evaluation. We aim to develop and validate a Coronary Health Index (CHI) incorporating total IgLCs levels and their distribution.

View Article and Find Full Text PDF

Similar Publications

Comprehensive analysis of scRNA-seq and bulk RNA-seq reveals the non-cardiomyocytes heterogeneity and novel cell populations in dilated cardiomyopathy.

J Transl Med

January 2025

State Key Laboratory of Cardiovascular Diseases and Medical Innovation Center, School of Medicine, Shanghai East Hospital, Tongji University, Shanghai, 200120, China.

Siyu He Chunyu Li Mingxin Lu Fang Lin Sangyu Hu

Background: Dilated cardiomyopathy (DCM) is one of the most common causes of heart failure. Infiltration and alterations in non-cardiomyocytes of the human heart involve crucially in the occurrence of DCM and associated immunotherapeutic approaches.

Methods: We constructed a single-cell transcriptional atlas of DCM and normal patients.

View Article and Find Full Text PDF

Similar Publications

A hybrid CNN-Bi-LSTM model with feature fusion for accurate epilepsy seizure detection.

BMC Med Inform Decis Mak

January 2025

The First Affiliated Hospital, and College of Clinical Medicine of Henan University of Science and Technology, Luoyang, China.

Xiaoshuai Cao Shaojie Zheng Jincan Zhang Wenna Chen Ganqin Du

Background: The diagnosis and treatment of epilepsy continue to face numerous challenges, highlighting the urgent need for the development of rapid, accurate, and non-invasive methods for seizure detection. In recent years, advancements in the analysis of electroencephalogram (EEG) signals have garnered widespread attention, particularly in the area of seizure recognition.

Methods: A novel hybrid deep learning approach that combines feature fusion for efficient seizure detection is proposed in this study.

View Article and Find Full Text PDF

Similar Publications

Identification of core genes related to exosomes and screening of potential targets in periodontitis using transcriptome profiling at the single-cell level.

BMC Oral Health

January 2025

Department of Stomatology, People's Hospital of Xinjiang Autonomous Region, Urumqi City, China.

Wufanbieke Baheti Diwen Dong Congcong Li Xiaotao Chen

Background: The progression and severity of periodontitis (PD) are associated with the release of extracellular vesicles by periodontal tissue cells. However, the precise mechanisms through which exosome-related genes (ERGs) influence PD remain unclear. This study aimed to investigate the role and potential mechanisms of key exosome-related genes in PD using transcriptome profiling at the single-cell level.

View Article and Find Full Text PDF

Similar Publications

External validation of AI-based scoring systems in the ICU: a systematic review and meta-analysis.

BMC Med Inform Decis Mak

January 2025

QUEST Center for Responsible Research, Berlin Institute of Health at Charité Universitätsmedizin Berlin, Berlin, Germany.

Patrick Rockenschaub Ela Marie Akay Benjamin Gregory Carlisle Adam Hilbert Joshua Wendland

Background: Machine learning (ML) is increasingly used to predict clinical deterioration in intensive care unit (ICU) patients through scoring systems. Although promising, such algorithms often overfit their training cohort and perform worse at new hospitals. Thus, external validation is a critical - but frequently overlooked - step to establish the reliability of predicted risk scores to translate them into clinical practice.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!