Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers.

Bioinformatics

Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, NSW 2007, Australia.

Published: November 2019

Motivation: Detection of maximal exact matches (MEMs) between two long sequences is a fundamental problem in pairwise reference-query genome comparisons. To efficiently compare larger and larger genomes, reducing the number of indexed k-mers as well as the number of query k-mers has been adopted as a mainstream approach which saves the computational resources by avoiding a significant number of unnecessary matches.

Results: Under this framework, we proposed a new method to detect all MEMs from a pair of genomes. The method first performs a fixed sampling of k-mers on the query sequence, and adds these selected k-mers to a Bloom filter. Then all the k-mers of the reference sequence are tested by the Bloom filter. If a k-mer passes the test, it is inserted into a hash table for indexing. Compared with the existing methods, much less number of query k-mers are generated and much less k-mers are inserted into the index to avoid unnecessary matches, leading to an efficient matching process and memory usage savings. Experiments on large genomes demonstrate that our method is at least 1.8 times faster than the best of the existing algorithms. This performance is mainly attributed to the key novelty of our method that the fixed k-mer sampling must be conducted on the query sequence and the index k-mers are filtered from the reference sequence via a Bloom filter.

Availability And Implementation: https://github.com/yuansliu/bfMEM.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btz273DOI Listing

Publication Analysis

Top Keywords

query k-mers
12
k-mers
10
detection maximal
8
maximal exact
8
exact matches
8
fixed sampling
8
k-mers bloom
8
number query
8
query sequence
8
bloom filter
8

Similar Publications

Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation.

View Article and Find Full Text PDF

Genomic data sequencing is crucial for understanding biological systems. As genomic databases like the European Nucleotide Archive expand exponentially, efficient data manipulation is essential. A key challenge is querying these databases to determine the presence or absence of specific sequences and their abundance within datasets.

View Article and Find Full Text PDF

Sequences derived from organisms sharing common evolutionary origins exhibit similarity, while unique sequences, absent in related organisms, act as good diagnostic marker candidates. However, the approach focused on identifying dissimilar regions among closely-related organisms poses challenges as it requires complex multiple sequence alignments, making computation and parsing difficult. To address this, we have developed a biologically inspired universal NAUniSeq algorithm to find the unique sequences for microorganism diagnosis by traveling through the phylogeny of life.

View Article and Find Full Text PDF

Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.

J Comput Biol

October 2024

Department of Computer Science, University of Maryland, College Park, Maryland, USA.

We describe lossless compressed data structures for the de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from -mers to their . The color set of a -mer is the set of all identifiers, or , of the references that contain the -mer.

View Article and Find Full Text PDF

Memory-bound -mer selection for large and evolutionarily diverse reference libraries.

Genome Res

October 2024

Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, California 92093, USA;

Using -mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these downstream applications relies on the density of the reference databases, which are rapidly growing. Although the increased density provides hope for improvements in accuracy, scalability is a concern.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!