Fast search of thousands of short-read sequencing experiments.

Nat Biotechnol

Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.

Published: March 2016

The amount of sequence information in public repositories is growing at a rapid rate. Although these data are likely to contain clinically important information that has not yet been uncovered, our ability to effectively mine these repositories is limited. Here we introduce Sequence Bloom Trees (SBTs), a method for querying thousands of short-read sequencing experiments by sequence, 162 times faster than existing approaches. The approach searches large data archives for all experiments that involve a given sequence. We use SBTs to search 2,652 human blood, breast and brain RNA-seq experiments for all 214,293 known transcripts in under 4 days using less than 239 MB of RAM and a single CPU. Searching sequence archives at this scale and in this time frame is currently not possible using existing tools.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4804353PMC
http://dx.doi.org/10.1038/nbt.3442DOI Listing

Publication Analysis

Top Keywords

thousands short-read
8
short-read sequencing
8
sequencing experiments
8
sequence
5
fast search
4
search thousands
4
experiments
4
experiments amount
4
amount sequence
4
sequence public
4

Similar Publications

Tandem repeats are a highly polymorphic class of genomic variation that play causal roles in rare diseases but are notoriously difficult to sequence using short-read techniques . Most previous studies profiling tandem repeats genome-wide have reduced the description of each locus to the singular value of the length of the entire repetitive locus . Here we introduce a comprehensive database of 3.

View Article and Find Full Text PDF

Unveiling Tissue-Specific RNA Landscapes in Mouse Organs During Fasting and Feeding Using Nanopore Direct RNA Sequencing.

Adv Sci (Weinh)

December 2024

Cardiovascular Branch, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, 20892, USA.

Understanding tissue-specific RNA landscapes is essential for uncovering the functional mechanisms of key organs in mammals. However, current knowledge remains limited, as short-read RNA sequencing-the predominant method for assessing gene expression-depends on incomplete gene annotations and struggles to resolve the diverse transcripts produced by genes. To address these limitations, an integrative approach combining nanopore direct RNA sequencing (DRS), ATAC-Seq, and short-read RNA-seq is used.

View Article and Find Full Text PDF

The naked mole-rat (NMR; ) is a eusocial subterranean rodent with a highly unusual set of physiological traits that has attracted great interest amongst the scientific community. However, the genetic basis of most of these traits has not been elucidated. To facilitate our understanding of the molecular mechanisms underlying NMR physiology and behaviour, we generated a long-read chromosomal-level genome assembly of the NMR.

View Article and Find Full Text PDF

Screening great ape museum specimens for DNA viruses.

Sci Rep

November 2024

Department of Evolutionary Anthropology, University of Vienna, Djerassiplatz 1, 1030, Vienna, Austria.

Natural history museum collections harbour a record of wild species from the past centuries, providing a unique opportunity to study animals as well as their infectious agents. Thousands of great ape specimens are kept in these collections, and could become an important resource for studying the evolution of DNA viruses. Their genetic material is likely to be preserved in dry museum specimens, as reported previously for monkeypox virus genomes from historical orangutan specimens.

View Article and Find Full Text PDF

The common bed bug, Cimex lectularius, is a globally distributed pest insect of medical, veterinary, and economic importance. Previous reference genome assemblies for this species were generated from short read sequencing data, resulting in a ~650 Mb composed of thousands of contigs. Here, we present a haplotype-resolved, chromosome-level reference genome, generated from an adult Harlen strain female specimen.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!