Sapling: accelerating suffix array queries with learned data models.

Melanie Kirsche Arun Das Michael C Schatz

Bioinformatics

Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA.

Published: May 2021

Motivation: As genomic data becomes more abundant, efficient algorithms and data structures for sequence alignment become increasingly important. The suffix array is a widely used data structure to accelerate alignment, but the binary search algorithm used to query, it requires widespread memory accesses, causing a large number of cache misses on large datasets.

Results: Here, we present Sapling, an algorithm for sequence alignment, which uses a learned data model to augment the suffix array and enable faster queries. We investigate different types of data models, providing an analysis of different neural network models as well as providing an open-source aligner with a compact, practical piecewise linear model. We show that Sapling outperforms both an optimized binary search approach and multiple widely used read aligners on a diverse collection of genomes, including human, bacteria and plants, speeding up the algorithm by more than a factor of two while adding <1% to the suffix array's memory footprint.

Availability And Implementation: The source code and tutorial are available open-source at https://github.com/mkirsche/sapling.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF	Source
http://dx.doi.org/10.1093/bioinformatics/btaa911	DOI Listing

Publication Analysis

Top Keywords

suffix array

learned data

data models

sequence alignment

binary search

data

sapling accelerating

accelerating suffix

array queries

queries learned

Similar Publications

Unipept in 2024: Expanding Metaproteomics Analysis with Support for Missed Cleavages and Semitryptic and Nontryptic Peptides.

J Proteome Res

January 2025

Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000 Ghent, Belgium.

Tibo Vande Moortele Bram Devlaminck Simon Van de Vyver Tim Van Den Bossche Lennart Martens

Unipept, a pioneering software tool in metaproteomics, has significantly advanced the analysis of complex ecosystems by facilitating both taxonomic and functional insights from environmental samples. From the onset, Unipept's capabilities focused on tryptic peptides, utilizing the predictability and consistency of trypsin digestion to efficiently construct a protein reference database. However, the evolving landscape of proteomics and emerging fields like immunopeptidomics necessitate a more versatile approach that extends beyond the analysis of tryptic peptides.

View Article and Find Full Text PDF

Similar Publications

4-Gbps low-latency FPGA-based underwater wireless optical communication.

Opt Express

October 2024

Tianyi Zhang Chao Fei Yuan Wang Ji Du Yitong Xie

In this paper, a high-speed and real-time underwater wireless optical communication (UWOC) system based on orthogonal frequency division multiplexing (OFDM) is designed and demonstrated using the field programmable gate array (FPGA) with a miniaturized demo board designed and made by ourselves. Through the parallel signal processing mode (i.e.

View Article and Find Full Text PDF

Similar Publications

Building a pangenome alignment index via recursive prefix-free parsing.

iScience

October 2024

Department of Computer and Information Science and Engineering, Herbert-Wertheim College of Engineering, University of Florida, Gainesville, FL 32607, USA.

Eddie Ferro Marco Oliva Travis Gagie Christina Boucher

Article Synopsis

Pangenomics alignment helps reduce bias in biomedical research by allowing multiple genomes to be indexed, unlike traditional methods that focus on a single reference genome.
New aligners like VG, Giraffe, and Moni use advanced techniques to manage the complexities of indexing more genomes while tackling challenges related to data size.
The proposed method for scaling Moni improves efficiency by optimizing the construction of key data structures (RLBWT, suffix array, and LCP) through recursive prefix-free parsing, which is crucial for handling large pangenomes like those from the Human Pangenome Reference Consortium.

View Article and Find Full Text PDF

Similar Publications

Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests.

Lebniz Int Proc Inform

July 2024

CeBiB & University of Chile, Chile.

Dominika Draesslerová Omar Ahmed Travis Gagie Jan Holub Ben Langmead

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use -mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can ■ build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; ■ for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; ■ find the minimum and maximum values stored in that interval; ■ take the lowest common ancestor (LCA) of the genomes containing the characters at those positions.

View Article and Find Full Text PDF

Similar Publications

Efficient Construction and Utilization of k-Ordered FM-indexes with kISS for Ultra-Fast Read Mapping in Large Genomes.

Bioinformatics

June 2024

Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan.

Zheng-Dao Yang Hsuan-Yu Kuo Po-Wei Hsieh Jui-Hung Hung

Motivation: The Full-text index in Minute space (FM-index) is a memory-efficient data structure widely used in bioinformatics for solving the fundamental pattern-matching task of searching for short patterns within a long reference. With the demand for short query patterns, the k-ordered concept has been proposed for FM-indexes. However, few construction algorithms in the state of the art fully exploit this idea to achieve significant speedups in the pan-genome era.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!