A random forest classifier for detecting rare variants in NGS data from viral populations.

Comput Struct Biotechnol J

School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA.

Published: July 2017

We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of -mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies -mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of -mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that -mers of a given size constitute a frame. We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives -mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their assembly. It has high recall of the true -mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5548337PMC
http://dx.doi.org/10.1016/j.csbj.2017.07.001DOI Listing

Publication Analysis

Top Keywords

rare variants
16
random forest
12
forest classifier
12
viral populations
12
viral population
12
classifier detecting
8
detecting rare
8
ngs data
8
data viral
8
signal processing
8

Similar Publications

Essential Thrombocythemia: A Review.

JAMA

January 2025

CRIMM, Center Research and Innovation of Myeloproliferative Neoplasms, University of Florence, AOU Careggi, Florence, Italy.

Importance: Essential thrombocythemia, a clonal myeloproliferative neoplasm with excessive platelet production, is associated with an increased risk of thrombosis and bleeding. The annual incidence rate of essential thrombocythemia in the US is 1.5/100 000 persons.

View Article and Find Full Text PDF

Autosomal recessive proximal renal tubular acidosis (AR-pRTA) with ocular abnormalities is a rare syndrome caused by variants in the SLC4A4 gene, which encodes Na/HCO3 cotransporter (NBCe1). The syndrome primarily affects the kidneys, but also causes extra-renal manifestations. Pancreatic type NBCe1 is located at the basolateral membrane of the pancreatic ductal cells and together with CFTR chloride channel, it is involved in bicarbonate secretion.

View Article and Find Full Text PDF

Renovascular hypertension is the second leading cause of hypertension. Twenty-seven genes have been attributed to monogenic renovascular hypertension at present. We present a 15-year-old boy with facial dysmorphism, thick skin and renovascular hypertension with a novel gain-of-function variant in SMAD4 gene suggesting Myhre syndrome.

View Article and Find Full Text PDF

Variants of uncertain significance (VUS) represent variants that lack sufficient evidence to be confidently associated with a disease, thus posing a challenge in the interpretation of genetic testing results. Here we report an improved method for predicting the VUS of Arylsulfatase A (ARSA) gene as part of the Critical Assessment of Genome Interpretation challenge (CAGI6). Our method uses a transfer learning approach that leverages a pre-trained protein language model to predict the impact of mutations on the activity of the ARSA enzyme, whose deficiency is known to cause a rare genetic disorder, metachromatic leukodystrophy.

View Article and Find Full Text PDF

Uterine tumor resembling ovarian sex cord tumor (UTROSCT) is a rare, typically benign uterine tumor occurring over a wide age range (mean 52.4 yr). UTROSCTs often harbor translocations between ESR1 and nuclear receptor coactivators NCOA1-NCOA3.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!