Motivation: DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using relational logic and to apply that logic across samples from multiple whole-genome sequencing samples.
Results: We have built a compact, efficiently-indexed database that contains the raw read data for over 250 human genomes, encompassing trillions of bases of DNA, and that allows users to search these data in real-time. The Terabase Search Engine enables retrieval from this database of all the reads for any genomic location in a matter of seconds. Users can search using a range of positions or a specific sequence that is aligned to the genome on the fly.
Availability And Implementation: Public access to the Terabase Search Engine database is available at http://tse.idies.jhu.edu.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6379032 | PMC |
http://dx.doi.org/10.1093/bioinformatics/bty657 | DOI Listing |
Bioinformatics
November 2024
Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, United States.
Motivation: Burrows-Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions.
View Article and Find Full Text PDFMicrob Genom
January 2022
Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden.
Macrolides are broad-spectrum antibiotics used to treat a range of infections. Resistance to macrolides is often conferred by mobile resistance genes encoding Erm methyltransferases or Mph phosphotransferases. New and genes keep being discovered in clinical settings but their origins remain unknown, as is the type of macrolide resistance genes that will appear in the future.
View Article and Find Full Text PDFJ Antimicrob Chemother
January 2021
Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden.
Background: Metallo-β-lactamases (MBLs) are enzymes that use zinc-dependent hydrolysis to confer resistance to almost all available β-lactam antibiotics. They are hypothesized to originate from commensal and environmental bacteria, from where some have mobilized and transferred horizontally to pathogens. The current phylogeny of MBLs, however, is biased as it is founded largely on genes encountered in pathogenic bacteria.
View Article and Find Full Text PDFBioinformatics
February 2019
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.
Motivation: DNA sequencing archives have grown to enormous scales in recent years, and thousands of human genomes have already been sequenced. The size of these data sets has made searching the raw read data infeasible without high-performance data-query technology. Additionally, it is challenging to search a repository of short-read data using relational logic and to apply that logic across samples from multiple whole-genome sequencing samples.
View Article and Find Full Text PDFBioinformatics
May 2017
Bioinformatics Division, TNLIST, and Department of Computer Science and Technology, Tsinghua University, Beijing, China.
Motivation: Combining a 16S rRNA (16S) gene database with metagenomic shotgun sequences promises unbiased identification of known and novel microbes.
Results: To achieve this, we herein report reference-based ribosome assembly (RAMBL), a computational pipeline, which integrates taxonomic tree search and Dirichlet process clustering to reconstruct full-length 16S gene sequences from metagenomic sequencing data with high accuracy. By benchmarking against the synthetic and real shotgun sequences, we demonstrated that full-length 16S gene assemblies of RAMBL were a good proxy for known and putative microbes, including Candidate Phyla Radiation.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!