Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers.

Bioinformatics

Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, NSW 2007, Australia.

Published: November 2019

Motivation: Detection of maximal exact matches (MEMs) between two long sequences is a fundamental problem in pairwise reference-query genome comparisons. To efficiently compare larger and larger genomes, reducing the number of indexed k-mers as well as the number of query k-mers has been adopted as a mainstream approach which saves the computational resources by avoiding a significant number of unnecessary matches.

Results: Under this framework, we proposed a new method to detect all MEMs from a pair of genomes. The method first performs a fixed sampling of k-mers on the query sequence, and adds these selected k-mers to a Bloom filter. Then all the k-mers of the reference sequence are tested by the Bloom filter. If a k-mer passes the test, it is inserted into a hash table for indexing. Compared with the existing methods, much less number of query k-mers are generated and much less k-mers are inserted into the index to avoid unnecessary matches, leading to an efficient matching process and memory usage savings. Experiments on large genomes demonstrate that our method is at least 1.8 times faster than the best of the existing algorithms. This performance is mainly attributed to the key novelty of our method that the fixed k-mer sampling must be conducted on the query sequence and the index k-mers are filtered from the reference sequence via a Bloom filter.

Availability And Implementation: https://github.com/yuansliu/bfMEM.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF	Source
http://dx.doi.org/10.1093/bioinformatics/btz273	DOI Listing

Publication Analysis

Top Keywords

query k-mers

k-mers

detection maximal

maximal exact

exact matches

fixed sampling

k-mers bloom

number query

query sequence

bloom filter

Similar Publications

Mem-based pangenome indexing for k-mer queries.

Algorithms Mol Biol

March 2025

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.

Stephen Hwang Nathaniel K Brown Omar Y Ahmed Katharine M Jenike Sam Kovaka

Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation.

View Article and Find Full Text PDF

Similar Publications

The backpack quotient filter: A dynamic and space-efficient data structure for querying -mers with abundance.

iScience

December 2024

University Rennes, Inria, CNRS, IRISA - UMR 6074, 35000 Rennes, France.

Victor Levallois Francesco Andreace Bertrand Le Gal Yoann Dufresne Pierre Peterlongo

Genomic data sequencing is crucial for understanding biological systems. As genomic databases like the European Nucleotide Archive expand exponentially, efficient data manipulation is essential. A key challenge is querying these databases to determine the presence or absence of specific sequences and their abundance within datasets.

View Article and Find Full Text PDF

Similar Publications

Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection.

Brief Bioinform

September 2024

Department of Zoology, University of Rajasthan, Jawahar Lal Nehru Marg, Talvandi, Jaipur, Rajasthan 302004, India.

Gulshan Kumar Sharma Rakesh Sharma Kavita Joshi Sameer Qureshi Shubhita Mathur

Sequences derived from organisms sharing common evolutionary origins exhibit similarity, while unique sequences, absent in related organisms, act as good diagnostic marker candidates. However, the approach focused on identifying dissimilar regions among closely-related organisms poses challenges as it requires complex multiple sequence alignments, making computation and parsing difficult. To address this, we have developed a biologically inspired universal NAUniSeq algorithm to find the unique sequences for microorganism diagnosis by traveling through the phylogeny of life.

View Article and Find Full Text PDF

Similar Publications

Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.

J Comput Biol

October 2024

Department of Computer Science, University of Maryland, College Park, Maryland, USA.

Alessio Campanelli Giulio Ermanno Pibiri Jason Fan Rob Patro

We describe lossless compressed data structures for the de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from -mers to their . The color set of a -mer is the set of all identifiers, or , of the references that contain the -mer.

View Article and Find Full Text PDF

Similar Publications

Memory-bound -mer selection for large and evolutionarily diverse reference libraries.

Genome Res

October 2024

Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, California 92093, USA;

Ali Osman Berk Şapcı Siavash Mirarab

Using -mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these downstream applications relies on the density of the reference databases, which are rapidly growing. Although the increased density provides hope for improvements in accuracy, scalability is a concern.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!