Lean and deep models for more accurate filtering of SNP and INDEL variant calls.

Sam Friedman Laura Gauthier Yossi Farjoun Eric Banks

Bioinformatics

Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.

Published: April 2020

The study explores how convolutional neural networks (CNNs) can be used to filter small genomic variants in short DNA sequences, addressing challenges caused by sequencing errors.
CNNs are trained to differentiate between real genetic variations and artifacts by learning to recognize specific DNA motifs, confirmed to be biologically relevant.
The approach shows significant improvements in sensitivity and precision compared to existing filtration methods, with a tailored tensor encoding strategy that accommodates the unique structure of genomic data.

Summary: We investigate convolutional neural networks (CNNs) for filtering small genomic variants in short-read DNA sequence data. Errors created during sequencing and library preparation make variant calling a difficult task. Encoding the reference genome and aligned reads covering sites of genetic variation as numeric tensors allows us to leverage CNNs for variant filtration. Convolutions over these tensors learn to detect motifs useful for classifying variants. Variant filtering models are trained to classify variants as artifacts or real variation. Visualizing the learned weights of the CNN confirmed it detects familiar DNA motifs known to correlate with real variation, like homopolymers and short tandem repeats (STR). After confirmation of the biological plausibility of the learned features we compared our model to current state-of-the-art filtration methods like Gaussian Mixture Models, Random Forests and CNNs designed for image classification, like DeepVariant. We demonstrate improvements in both sensitivity and precision. The tensor encoding was carefully tailored for processing genomic data, respecting the qualitative differences in structure between DNA and natural images. Ablation tests quantitatively measured the benefits of our tensor encoding strategy. Bayesian hyper-parameter optimization confirmed our notion that architectures designed with DNA data in mind outperform off-the-shelf image classification models. Our cross-generalization analysis identified idiosyncrasies in truth resources pointing to the need for new methods to construct genomic truth data. Our results show that models trained on heterogenous data types and diverse truth resources generalize well to new datasets, negating the need to train separate models for each data type.

Availability And Implementation: This work is available in the Genome Analysis Toolkit (GATK) with the tool name CNNScoreVariants (https://github.com/broadinstitute/gatk).

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF	Source
http://dx.doi.org/10.1093/bioinformatics/btz901	DOI Listing

Publication Analysis

Top Keywords

models trained

real variation

image classification

tensor encoding

truth resources

data

models

lean deep

deep models

models accurate

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!