Sapling: accelerating suffix array queries with learned data models.

Bioinformatics

Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA.

Published: May 2021

Motivation: As genomic data becomes more abundant, efficient algorithms and data structures for sequence alignment become increasingly important. The suffix array is a widely used data structure to accelerate alignment, but the binary search algorithm used to query, it requires widespread memory accesses, causing a large number of cache misses on large datasets.

Results: Here, we present Sapling, an algorithm for sequence alignment, which uses a learned data model to augment the suffix array and enable faster queries. We investigate different types of data models, providing an analysis of different neural network models as well as providing an open-source aligner with a compact, practical piecewise linear model. We show that Sapling outperforms both an optimized binary search approach and multiple widely used read aligners on a diverse collection of genomes, including human, bacteria and plants, speeding up the algorithm by more than a factor of two while adding <1% to the suffix array's memory footprint.

Availability And Implementation: The source code and tutorial are available open-source at https://github.com/mkirsche/sapling.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btaa911DOI Listing

Publication Analysis

Top Keywords

suffix array
12
learned data
8
data models
8
sequence alignment
8
binary search
8
data
6
sapling accelerating
4
accelerating suffix
4
array queries
4
queries learned
4

Similar Publications

Unipept, a pioneering software tool in metaproteomics, has significantly advanced the analysis of complex ecosystems by facilitating both taxonomic and functional insights from environmental samples. From the onset, Unipept's capabilities focused on tryptic peptides, utilizing the predictability and consistency of trypsin digestion to efficiently construct a protein reference database. However, the evolving landscape of proteomics and emerging fields like immunopeptidomics necessitate a more versatile approach that extends beyond the analysis of tryptic peptides.

View Article and Find Full Text PDF

In this paper, a high-speed and real-time underwater wireless optical communication (UWOC) system based on orthogonal frequency division multiplexing (OFDM) is designed and demonstrated using the field programmable gate array (FPGA) with a miniaturized demo board designed and made by ourselves. Through the parallel signal processing mode (i.e.

View Article and Find Full Text PDF

Building a pangenome alignment index via recursive prefix-free parsing.

iScience

October 2024

Department of Computer and Information Science and Engineering, Herbert-Wertheim College of Engineering, University of Florida, Gainesville, FL 32607, USA.

Article Synopsis
  • Pangenomics alignment helps reduce bias in biomedical research by allowing multiple genomes to be indexed, unlike traditional methods that focus on a single reference genome.
  • New aligners like VG, Giraffe, and Moni use advanced techniques to manage the complexities of indexing more genomes while tackling challenges related to data size.
  • The proposed method for scaling Moni improves efficiency by optimizing the construction of key data structures (RLBWT, suffix array, and LCP) through recursive prefix-free parsing, which is crucial for handling large pangenomes like those from the Human Pangenome Reference Consortium.
View Article and Find Full Text PDF

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use -mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can ■ build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; ■ for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; ■ find the minimum and maximum values stored in that interval; ■ take the lowest common ancestor (LCA) of the genomes containing the characters at those positions.

View Article and Find Full Text PDF

Motivation: The Full-text index in Minute space (FM-index) is a memory-efficient data structure widely used in bioinformatics for solving the fundamental pattern-matching task of searching for short patterns within a long reference. With the demand for short query patterns, the k-ordered concept has been proposed for FM-indexes. However, few construction algorithms in the state of the art fully exploit this idea to achieve significant speedups in the pan-genome era.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!