Background: The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows-Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections.

Result: In this paper we introduce gsufsort, an open source software for constructing the suffix array and related data indexing structures for a string collection with symbols in () time. Our tool is written in ANSI/C and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22-39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings.

Conclusions: gsufsort is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7507297PMC
http://dx.doi.org/10.1186/s13015-020-00177-yDOI Listing

Publication Analysis

Top Keywords

suffix array
20
constructing suffix
12
suffix arrays
12
string collections
12
construct suffix
12
suffix
8
data structures
8
arrays string
8
structures string
8
array
7

Similar Publications

Unipept, a pioneering software tool in metaproteomics, has significantly advanced the analysis of complex ecosystems by facilitating both taxonomic and functional insights from environmental samples. From the onset, Unipept's capabilities focused on tryptic peptides, utilizing the predictability and consistency of trypsin digestion to efficiently construct a protein reference database. However, the evolving landscape of proteomics and emerging fields like immunopeptidomics necessitate a more versatile approach that extends beyond the analysis of tryptic peptides.

View Article and Find Full Text PDF

In this paper, a high-speed and real-time underwater wireless optical communication (UWOC) system based on orthogonal frequency division multiplexing (OFDM) is designed and demonstrated using the field programmable gate array (FPGA) with a miniaturized demo board designed and made by ourselves. Through the parallel signal processing mode (i.e.

View Article and Find Full Text PDF

Building a pangenome alignment index via recursive prefix-free parsing.

iScience

October 2024

Department of Computer and Information Science and Engineering, Herbert-Wertheim College of Engineering, University of Florida, Gainesville, FL 32607, USA.

Article Synopsis
  • Pangenomics alignment helps reduce bias in biomedical research by allowing multiple genomes to be indexed, unlike traditional methods that focus on a single reference genome.
  • New aligners like VG, Giraffe, and Moni use advanced techniques to manage the complexities of indexing more genomes while tackling challenges related to data size.
  • The proposed method for scaling Moni improves efficiency by optimizing the construction of key data structures (RLBWT, suffix array, and LCP) through recursive prefix-free parsing, which is crucial for handling large pangenomes like those from the Human Pangenome Reference Consortium.
View Article and Find Full Text PDF

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use -mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can ■ build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; ■ for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; ■ find the minimum and maximum values stored in that interval; ■ take the lowest common ancestor (LCA) of the genomes containing the characters at those positions.

View Article and Find Full Text PDF

Motivation: The Full-text index in Minute space (FM-index) is a memory-efficient data structure widely used in bioinformatics for solving the fundamental pattern-matching task of searching for short patterns within a long reference. With the demand for short query patterns, the k-ordered concept has been proposed for FM-indexes. However, few construction algorithms in the state of the art fully exploit this idea to achieve significant speedups in the pan-genome era.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!