An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

Fatemeh Almodaresi Prashant Pandey Michael Ferdman Rob Johnson Rob Patro

J Comput Biol

Department of Computer Science, University of Maryland, College Park, Maryland.

Published: April 2020

The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7185321	PMC
http://dx.doi.org/10.1089/cmb.2019.0322	DOI Listing

Publication Analysis

Top Keywords

bruijn graph

color

population-level variation

variation detection

large-scale sequence

sequence search

encoding color

encoding

efficient scalable

scalable exact

Similar Publications

Machine learning reveals the dynamic importance of accessory sequences for outbreak clustering.

mBio

January 2025

Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada.

Chao Chun Liu William W L Hsiao

Unlabelled: Bacterial typing at whole-genome scales is now feasible owing to decreasing costs in high-throughput sequencing and the recent advances in computation. The unprecedented resolution of whole-genome typing is achieved by genotyping the variable segments of bacterial genomes that can fluctuate significantly in gene content. However, due to the transient and hypervariable nature of many accessory elements, the value of the added resolution in outbreak investigations remains disputed.

View Article and Find Full Text PDF

Similar Publications

Fast and flexible minimizer digestion with digest.

bioRxiv

January 2025

Department of Computer Science, Johns Hopkins University, 3400 N Charles St, 21218, Maryland, USA.

Alan Zheng Ishmeal Lee Vikram S Shivakumar Omar Y Ahmed Ben Langmead

Minimizer digestion is an increasingly common component of bioinformatics tools, including tools for De Bruijn-Graph assembly and sequence classification. We describe a new open source tool and library to facilitate efficient digestion of genomic sequences. It can produce digests based on the related ideas of minimizers, modimizers or syncmers.

View Article and Find Full Text PDF

Similar Publications

Anchorage Accurately Assembles Anchor-Flanked Synthetic Long Reads.

Lebniz Int Proc Inform

August 2024

Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.

Xiaofei Carl Zang Xiang Li Kyle Metcalfe Tuval Ben-Yehezkel Ryan Kelley

Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol.

View Article and Find Full Text PDF

Similar Publications

A reinforcement learning approach to effective forecasting of pediatric hypoglycemia in diabetes I patients using an extended de Bruijn graph.

Sci Rep

December 2024

Computer Science Department, Indiana University, Bloomington, IN, USA.

Mert Onur Cakiroglu Hasan Kurban Lilia Aljihmani Khalid Qaraqe Goran Petrovski

Pediatric diabetes I is an endemic and an especially difficult disease; indeed, at this point, there does not exist a cure, but only careful management that relies on anticipating hypoglycemia. The changing physiology of children producing unique blood glucose signatures, coupled with inconsistent activities, e.g.

View Article and Find Full Text PDF

Similar Publications

Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat.

Commun Biol

December 2024

College of Biology, Hunan University, Changsha, China.

Yuansheng Liu Yichen Li Enlian Chen Jialu Xu Wenhai Zhang

Error self-correction is crucial for analyzing long-read sequencing data, but existing methods often struggle with noisy data or are tailored to technologies like PacBio HiFi. There is a gap in methods optimized for Nanopore R10 simplex reads, which typically have error rates below 2%. We introduce DeChat, a novel approach designed specifically for these reads.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!