Motivation: DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using relational logic and to apply that logic across samples from multiple whole-genome sequencing samples.

Results: We have built a compact, efficiently-indexed database that contains the raw read data for over 250 human genomes, encompassing trillions of bases of DNA, and that allows users to search these data in real-time. The Terabase Search Engine enables retrieval from this database of all the reads for any genomic location in a matter of seconds. Users can search using a range of positions or a specific sequence that is aligned to the genome on the fly.

Availability And Implementation: Public access to the Terabase Search Engine database is available at http://tse.idies.jhu.edu.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6379032PMC
http://dx.doi.org/10.1093/bioinformatics/bty657DOI Listing

Publication Analysis

Top Keywords

terabase search
12
search engine
12
human genomes
8
raw read
8
read data
8
users search
8
data
6
search
5
engine large-scale
4
large-scale relational
4

Similar Publications

BWT construction and search at the terabase scale.

Bioinformatics

November 2024

Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States.

Motivation: Burrows-Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions.

View Article and Find Full Text PDF

Macrolides are broad-spectrum antibiotics used to treat a range of infections. Resistance to macrolides is often conferred by mobile resistance genes encoding Erm methyltransferases or Mph phosphotransferases. New and genes keep being discovered in clinical settings but their origins remain unknown, as is the type of macrolide resistance genes that will appear in the future.

View Article and Find Full Text PDF

Background: Metallo-β-lactamases (MBLs) are enzymes that use zinc-dependent hydrolysis to confer resistance to almost all available β-lactam antibiotics. They are hypothesized to originate from commensal and environmental bacteria, from where some have mobilized and transferred horizontally to pathogens. The current phylogeny of MBLs, however, is biased as it is founded largely on genes encountered in pathogenic bacteria.

View Article and Find Full Text PDF

Motivation: DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using relational logic and to apply that logic across samples from multiple whole-genome sequencing samples.

View Article and Find Full Text PDF

Large-scale 16S gene assembly using metagenomics shotgun sequences.

Bioinformatics

May 2017

Bioinformatics Division, TNLIST, and Department of Computer Science and Technology, Tsinghua University, Beijing, China.

Motivation: Combining a 16S rRNA (16S) gene database with metagenomic shotgun sequences promises unbiased identification of known and novel microbes.

Results: To achieve this, we herein report reference-based ribosome assembly (RAMBL), a computational pipeline, which integrates taxonomic tree search and Dirichlet process clustering to reconstruct full-length 16S gene sequences from metagenomic sequencing data with high accuracy. By benchmarking against the synthetic and real shotgun sequences, we demonstrated that full-length 16S gene assemblies of RAMBL were a good proxy for known and putative microbes, including Candidate Phyla Radiation.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!