Motivation: There exist several large genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph were developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed.
Results: We create a method for constructing the colored de Bruijn graph for large datasets that is based on partitioning the data into smaller datasets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this article. We refer to the resulting method as VariMerge. This construction method also allows the graph to be updated with new data. We validate our approach and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8000 strains. Lastly, we compare VariMerge to other competing methods-including Vari, Rainbowfish, Mantis, Bloom Filter Trie, the method of Almodaresi et al. and Multi-BRWT-and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16 000 strains in a manner that allows it to be updated. Competing methods either did not scale to this large of a dataset or do not allow for additions without reconstruction.
Availability And Implementation: VariMerge is available at https://github.com/cosmo-team/cosmo/tree/VARI-merge under GPLv3 license.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6612864 | PMC |
http://dx.doi.org/10.1093/bioinformatics/btz350 | DOI Listing |
NAR Genom Bioinform
September 2024
Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands.
The recent growth of microbial sequence data allows comparisons at unprecedented scales, enabling the tracking of strains, mobile genetic elements, or genes. Querying a genome against a large reference database can easily yield thousands of matches that are tedious to interpret and pose computational challenges. We developed Graphite that uses a colored de Bruijn graph (cDBG) to paint query genomes, selecting the local best matches along the full query length.
View Article and Find Full Text PDFJ Comput Biol
October 2024
Department of Computer Science, University of Maryland, College Park, Maryland, USA.
We describe lossless compressed data structures for the de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from -mers to their . The color set of a -mer is the set of all identifiers, or , of the references that contain the -mer.
View Article and Find Full Text PDFCompr Rev Food Sci Food Saf
September 2024
Institute of Food Technology and Food Chemistry, Department of Food Chemistry and Analysis, Technische Universität Berlin, Berlin, Germany.
Although noncovalent interactions and covalent reactions between phenolic compounds and proteins have been investigated across diverse scientific disciplines, a comprehensive understanding and identification of their products remain elusive. This review will initially outline the chemical framework and, subsequently, delve into unresolved or debated chemical and functional food-related implications, as well as forthcoming challenges in this topic. The primary objective is to elucidate the multiple aspects of protein-phenolic interactions and reactions, along with the underlying overwhelming dynamics and possibilities of follow-up reactions and potential crosslinking between proteins and phenolic compounds.
View Article and Find Full Text PDFGenome Res
August 2024
Department of Biostatistics, University of Oslo, 0372 Blindern, Norway.
Studies of bacterial adaptation and evolution are hampered by the difficulty of measuring traits such as virulence, drug resistance, and transmissibility in large populations. In contrast, it is now feasible to obtain high-quality complete assemblies of many bacterial genomes thanks to scalable high-accuracy long-read sequencing technologies. To exploit this opportunity, we introduce a phenotype- and alignment-free method for discovering coselected and epistatically interacting genomic variation from genome assemblies covering both core and accessory parts of genomes.
View Article and Find Full Text PDFUnlabelled: We describe lossless compressed data structures for the de Bruijn graph (or, c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from -mers to their . The color set of a -mer is the set of all identifiers, or , of the references that contain the -mer.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!