The Burrows-Wheeler transform (BWT) of short-read data has unexplored potential utilities, such as for efficient and sensitive variation analysis against multiple reference genome sequences, because it does not depend on any particular reference genome sequence, unlike conventional mapping-based methods. However, since the amount of read data is generally much larger than the size of the reference sequence, computation of the BWT of reads is not easy, and this hampers development of potential applications. For the alleviation of this problem, a new method of computing the BWT of reads in parallel is proposed. The BWT, corresponding to a sorted list of suffixes of reads, is constructed incrementally by successively including longer and longer suffixes. The working data is divided into more than 10,000 "blocks" corresponding to sublists of suffixes with the same prefixes. Thousands of groups of blocks can be processed in parallel while making exclusive writes and concurrent reads into a shared memory. Reads and writes are basically sequential, and the read concurrency is limited to two. Thus, a fine-grained parallelism, referred to as prefix parallelism, is expected to work efficiently. The time complexity for processing n reads of length l is O(nl). On actual biological DNA sequence data of about 100 Gbp with a read length of 100 bp (base pairs), a tentative implementation of the proposed method took less than an hour on a single-node computer; i.e., it was about three times faster than one of the fastest programs developed so far.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TCBB.2018.2837749DOI Listing

Publication Analysis

Top Keywords

burrows-wheeler transform
8
prefix parallelism
8
reference genome
8
bwt reads
8
reads
7
parallel computation
4
computation burrows-wheeler
4
transform short
4
short reads
4
reads prefix
4

Similar Publications

Pangenome indexes are promising tools for many applications, including classification of nanopore sequencing reads. Move structure is a compressed-index data structure based on the Burrows-Wheeler Transform (BWT). It offers simultaneous O(1)-time queries and O(r) space, where r is the number of BWT runs (consecutive sequence of identical characters).

View Article and Find Full Text PDF

BWT construction and search at the terabase scale.

Bioinformatics

November 2024

Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States.

Motivation: Burrows-Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions.

View Article and Find Full Text PDF

HAlign 4: a new strategy for rapidly aligning millions of sequences.

Bioinformatics

November 2024

Department of Statistics, Stanford University, Stanford, CA 94305-4065, United States.

Motivation: HAlign is a high-performance multiple sequence alignment software based on the star alignment strategy, which is the preferred choice for rapidly aligning large numbers of sequences. HAlign3, implemented in Java, is the latest version capable of aligning an ultra-large number of similar DNA/RNA sequences. However, HAlign3 still struggles with long sequences and extremely large numbers of sequences.

View Article and Find Full Text PDF

Fast and accurate short-read alignment with hybrid hash-tree data structure.

Genomics Inform

October 2024

K.K. Dnaform, Ask Sanshin Building 3F, 2-6-29, Tsurumi-chuo, Tsurumi-ku, Yokohama, Kanagawa, 230-0051, Japan.

Rapidly increasing the amount of short-read data generated by NGSs (new-generation sequencers) calls for the development of fast and accurate read alignment programs. The programs based on the hash table (BLAST) and Burrows-Wheeler transform (bwa-mem) are used, and the latter is known to give superior performance. We here present a new algorithm, a hybrid of hash table and suffix tree, which we designed to speed up the alignment of short reads against large reference sequences such as the human genome.

View Article and Find Full Text PDF

The proposed Sensor Data Encryption with Compression (SEC) system is presented as an innovative solution for sensor data processing, aiming to achieve optimal efficiency while improving security and adaptability by integrating robust cryptography with advanced compression techniques. The system's objective is to create a reliable framework for managing sensitive sensor data by effectively tackling challenges in data processing through a balanced combination of strong cryptography and sophisticated compression methods. The SEC system utilizes advanced encryption and compression techniques, including the scrambling of Burrows-Wheeler Transform (BWT), Move-to-Front (MTF) encoding, Run-Length Encoding (RLE), and Huffman coding, to attain exceptional compression efficiency while maintaining data integrity.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!