Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments.

Patterns (N Y)

Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, IL 61801, USA.

Published: September 2020

AI Article Synopsis

Article Abstract

Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise -mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the -mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven -mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7660437PMC
http://dx.doi.org/10.1016/j.patter.2020.100081DOI Listing

Publication Analysis

Top Keywords

jaccard similarity
20
spectral jaccard
12
approach estimating
8
pairwise sequence
8
proxy alignment
8
alignment size
8
alignment sizes
8
similarity
6
alignment
5
spectral
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!