Storage and retrieval of highly repetitive sequence collections.

J Comput Biol

Department of Computer Science, University of Helsinki, Helsinki, Finland .

Published: March 2010

A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occupies much space, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space of the suffix tree to, essentially, that of the compressed sequences, while retaining its functionality with only a polylogarithmic slowdown. However, the underlying compression model considers only the predictability of the next sequence symbol given the k previous ones, where k is a small integer. This is unable to capture longer-term repetitiveness. For example, r identical copies of an incompressible sequence will be incompressible under this model. We develop new static and dynamic full-text indexes that are able of capturing the fact that a collection is highly repetitive, and require space basically proportional to the length of one typical sequence plus the total number of edit operations. The new indexes can be plugged into a recent dynamic fully-compressed suffix tree, achieving full functionality for sequence analysis, while retaining the reduced space and the polylogarithmic slowdown. Our experimental results confirm the practicality of our proposal.

Download full-text PDF

Source
http://dx.doi.org/10.1089/cmb.2009.0169DOI Listing

Publication Analysis

Top Keywords

suffix tree
12
highly repetitive
8
repetitive sequence
8
edit operations
8
polylogarithmic slowdown
8
sequence
6
storage retrieval
4
retrieval highly
4
sequence collections
4
collections repetitive
4

Similar Publications

The Medina classification separates true bifurcation lesions into three unnecessary groups: 1.1.1, 1.

View Article and Find Full Text PDF

HAlign 4: a new strategy for rapidly aligning millions of sequences.

Bioinformatics

November 2024

Department of Statistics, Stanford University, Stanford, CA 94305-4065, United States.

Motivation: HAlign is a high-performance multiple sequence alignment software based on the star alignment strategy, which is the preferred choice for rapidly aligning large numbers of sequences. HAlign3, implemented in Java, is the latest version capable of aligning an ultra-large number of similar DNA/RNA sequences. However, HAlign3 still struggles with long sequences and extremely large numbers of sequences.

View Article and Find Full Text PDF

Fast and accurate short-read alignment with hybrid hash-tree data structure.

Genomics Inform

October 2024

K.K. Dnaform, Ask Sanshin Building 3F, 2-6-29, Tsurumi-chuo, Tsurumi-ku, Yokohama, Kanagawa, 230-0051, Japan.

Rapidly increasing the amount of short-read data generated by NGSs (new-generation sequencers) calls for the development of fast and accurate read alignment programs. The programs based on the hash table (BLAST) and Burrows-Wheeler transform (bwa-mem) are used, and the latter is known to give superior performance. We here present a new algorithm, a hybrid of hash table and suffix tree, which we designed to speed up the alignment of short reads against large reference sequences such as the human genome.

View Article and Find Full Text PDF

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use -mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can ■ build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; ■ for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; ■ find the minimum and maximum values stored in that interval; ■ take the lowest common ancestor (LCA) of the genomes containing the characters at those positions.

View Article and Find Full Text PDF

Finding maximal exact matches in graphs.

Algorithms Mol Biol

March 2024

Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, P.O. Box 68, Helsinki, 00014, Finland.

Background: We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least ( -MEMs).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!