Motivation: The de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time-efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction, e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes.
Results: In this article, we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods, e.g. FDBG cannot be constructed on large scale datasets, or cannot support both addition and deletion, e.g. BiFrost.
Availability And Implementation: DynamicBOSS is publicly available at https://github.com/baharpan/dynboss.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8337006 | PMC |
http://dx.doi.org/10.1093/bioinformatics/btaa546 | DOI Listing |
Lebniz Int Proc Inform
August 2024
Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.
Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol.
View Article and Find Full Text PDFSci Rep
December 2024
Computer Science Department, Indiana University, Bloomington, IN, USA.
Pediatric diabetes I is an endemic and an especially difficult disease; indeed, at this point, there does not exist a cure, but only careful management that relies on anticipating hypoglycemia. The changing physiology of children producing unique blood glucose signatures, coupled with inconsistent activities, e.g.
View Article and Find Full Text PDFCommun Biol
December 2024
College of Biology, Hunan University, Changsha, China.
Error self-correction is crucial for analyzing long-read sequencing data, but existing methods often struggle with noisy data or are tailored to technologies like PacBio HiFi. There is a gap in methods optimized for Nanopore R10 simplex reads, which typically have error rates below 2%. We introduce DeChat, a novel approach designed specifically for these reads.
View Article and Find Full Text PDFMicroorganisms
October 2024
Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS B3H 3C3, Canada.
Antimicrobial resistance (AMR) is an escalating global health threat, often driven by the horizontal gene transfer (HGT) of resistance genes. Detecting AMR genes and understanding their genomic context within bacterial populations is crucial for mitigating the spread of resistance. In this study, we evaluate the performance of three sequence alignment tools-Bandage, SPAligner, and GraphAligner-in identifying AMR gene sequences from assembly and de Bruijn graphs, which are commonly used in microbial genome assembly.
View Article and Find Full Text PDFBrief Bioinform
November 2024
College of Science, Dalian Jiaotong University, 794 Huanghe Road, Dalian 116028, China.
Noncoding RNA refers to RNA that does not encode proteins. The lncRNA and miRNA it contains play crucial regulatory roles in organisms, and their aberrant expression is closely related to various diseases. Traditional experimental methods for validating the interactions of these RNAs have limitations, and existing prediction models exhibit relatively limited functionality, relying on isolated feature extraction and performing poorly in handling various types of small sample tasks.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!