Building large updatable colored de Bruijn graphs via merging.

Martin D Muggli Bahar Alipanahi Christina Boucher

Bioinformatics

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA.

Published: July 2019

Motivation: There exist several large genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph were developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed.

Results: We create a method for constructing the colored de Bruijn graph for large datasets that is based on partitioning the data into smaller datasets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this article. We refer to the resulting method as VariMerge. This construction method also allows the graph to be updated with new data. We validate our approach and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8000 strains. Lastly, we compare VariMerge to other competing methods-including Vari, Rainbowfish, Mantis, Bloom Filter Trie, the method of Almodaresi et al. and Multi-BRWT-and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16 000 strains in a manner that allows it to be updated. Competing methods either did not scale to this large of a dataset or do not allow for additions without reconstruction.

Availability And Implementation: VariMerge is available at https://github.com/cosmo-team/cosmo/tree/VARI-merge under GPLv3 license.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6612864	PMC
http://dx.doi.org/10.1093/bioinformatics/btz350	DOI Listing

Publication Analysis

Top Keywords

colored bruijn

bruijn graph

constructing colored

updated data

graph

manner allows

building colored

colored

bruijn

data

Similar Publications

Graphite: painting genomes using a colored de Bruijn graph.

NAR Genom Bioinform

September 2024

Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands.

Rick Beeloo Aldert L Zomer Sebastian Deorowicz Bas E Dutilh

The recent growth of microbial sequence data allows comparisons at unprecedented scales, enabling the tracking of strains, mobile genetic elements, or genes. Querying a genome against a large reference database can easily yield thousands of matches that are tedious to interpret and pose computational challenges. We developed Graphite that uses a colored de Bruijn graph (cDBG) to paint query genomes, selecting the local best matches along the full query length.

View Article and Find Full Text PDF

Similar Publications

Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.

J Comput Biol

October 2024

Department of Computer Science, University of Maryland, College Park, Maryland, USA.

Alessio Campanelli Giulio Ermanno Pibiri Jason Fan Rob Patro

We describe lossless compressed data structures for the de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from -mers to their . The color set of a -mer is the set of all identifiers, or , of the references that contain the -mer.

View Article and Find Full Text PDF

Similar Publications

Protein-phenolic interactions and reactions: Discrepancies, challenges, and opportunities.

Compr Rev Food Sci Food Saf

September 2024

Institute of Food Technology and Food Chemistry, Department of Food Chemistry and Analysis, Technische Universität Berlin, Berlin, Germany.

Helena Kieserling Wouter J C de Bruijn Julia Keppler Jack Yang Sorel Tchewonpi Sagu

Although noncovalent interactions and covalent reactions between phenolic compounds and proteins have been investigated across diverse scientific disciplines, a comprehensive understanding and identification of their products remain elusive. This review will initially outline the chemical framework and, subsequently, delve into unresolved or debated chemical and functional food-related implications, as well as forthcoming challenges in this topic. The primary objective is to elucidate the multiple aspects of protein-phenolic interactions and reactions, along with the underlying overwhelming dynamics and possibilities of follow-up reactions and potential crosslinking between proteins and phenolic compounds.

View Article and Find Full Text PDF

Similar Publications

Pangenome-spanning epistasis and coselection analysis via de Bruijn graphs.

Genome Res

August 2024

Department of Biostatistics, University of Oslo, 0372 Blindern, Norway.

Juri Kuronen Samuel T Horsfield Anna K Pöntinen Sudaraka Mallawaarachchi Sergio Arredondo-Alonso

Studies of bacterial adaptation and evolution are hampered by the difficulty of measuring traits such as virulence, drug resistance, and transmissibility in large populations. In contrast, it is now feasible to obtain high-quality complete assemblies of many bacterial genomes thanks to scalable high-accuracy long-read sequencing technologies. To exploit this opportunity, we introduce a phenotype- and alignment-free method for discovering coselected and epistatically interacting genomic variation from genome assemblies covering both core and accessory parts of genomes.

View Article and Find Full Text PDF

Similar Publications

Where the patterns are: repetition-aware compression for colored de Bruijn graphs .

bioRxiv

July 2024

Alessio Campanelli Giulio Ermanno Pibiri Jason Fan Rob Patro

Unlabelled: We describe lossless compressed data structures for the de Bruijn graph (or, c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from -mers to their . The color set of a -mer is the set of all identifiers, or , of the references that contain the -mer.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!