Motivation: Counting the frequencies of k-mers in read libraries is often a first step in the analysis of high-throughput sequencing data. Infrequent k-mers are assumed to be a result of sequencing errors. The frequent k-mers constitute a reduced but error-free representation of the experiment, which can inform read error correction or serve as the input to de novo assembly methods. Ideally, the memory requirement for counting should be linear in the number of frequent k-mers and not in the, typically much larger, total number of k-mers in the read library.
Results: We present a novel method that balances time, space and accuracy requirements to efficiently extract frequent k-mers even for high-coverage libraries and large genomes such as human. Our method is designed to minimize cache misses in a cache-efficient manner by using a pattern-blocked Bloom filter to remove infrequent k-mers from consideration in combination with a novel sort-and-compact scheme, instead of a hash, for the actual counting. Although this increases theoretical complexity, the savings in cache misses reduce the empirical running times. A variant of method can resort to a counting Bloom filter for even larger savings in memory at the expense of false-negative rates in addition to the false-positive rates common to all Bloom filter-based approaches. A comparison with the state-of-the-art shows reduced memory requirements and running times.
Availability And Implementation: The tools are freely available for download at http://bioinformatics.rutgers.edu/Software/Turtle and http://figshare.com/articles/Turtle/791582.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1093/bioinformatics/btu132 | DOI Listing |
Methods Mol Biol
September 2024
Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, USA.
Three-dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, topologically associating domains (TADs), and A/B compartments, play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution.
View Article and Find Full Text PDFPlants (Basel)
July 2024
Department of Science in Smart Agriculture Systems, Chungnam National University, Daejeon 34134, Republic of Korea.
The Brassicaceae family is distinguished by its inclusion of high-value crops such as cabbage, broccoli, mustard, and wasabi, all noted for their glucosinolates. In this family, many polyploidy species are distributed and shaped by numerous whole-genome duplications, independent genome doublings, and hybridization events. The evolutionary trajectory of the family is marked by enhanced diversification and lineage splitting after paleo- and meso-polyploidization, with discernible remnants of whole-genome duplications within their genomes.
View Article and Find Full Text PDFArXiv
March 2024
Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA.
Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B compartments play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution.
View Article and Find Full Text PDFPLoS One
January 2024
Department of Computer Science, Université du Québec à Montréal, Montréal, Québec, Canada.
Machine learning was shown to be effective at identifying distinctive genomic signatures among viral sequences. These signatures are defined as pervasive motifs in the viral genome that allow discrimination between species or variants. In the context of SARS-CoV-2, the identification of these signatures can assist in taxonomic and phylogenetic studies, improve in the recognition and definition of emerging variants, and aid in the characterization of functional properties of polymorphic gene products.
View Article and Find Full Text PDFPlant Genome
December 2023
Département de phytologie, Université Laval, Québec, QC, Canada.
Genome-wide association studies (GWAS) are powerful statistical methods that detect associations between genotype and phenotype at genome scale. Despite their power, GWAS frequently fail to pinpoint the causal variant or the gene controlling a given trait in crop species. Assessing genetic variants other than single-nucleotide polymorphisms (SNPs) could alleviate this problem.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!