Background: Metagenomics holds great promises for deepening our knowledge of key bacterial driven processes, but metagenome assembly remains problematic, typically resulting in representation biases and discarding significant amounts of non-redundant sequence information. In order to alleviate constraints assembly can impose on downstream analyses, and/or to increase the fraction of raw reads assembled via targeted assemblies relying on pre-assembly binning steps, we developed a set of binning modules and evaluated their combination in a new "assembly-free" binning protocol.
Results: We describe a scalable multi-tiered binning algorithm that combines frequency and compositional features to cluster unassembled reads, and demonstrate i) significant runtime performance gains of the developed modules against state of the art software, obtained through parallelization and the efficient use of large lock-free concurrent hash maps, ii) its relevance for clustering unassembled reads from high complexity (e.g., harboring 700 distinct genomes) samples, iii) its relevance to experimental setups involving multiple samples, through a use case consisting in the "de novo" identification of sequences from a target genome (e.g., a pathogenic strain) segregating at low levels in a cohort of 50 complex microbiomes (harboring 100 distinct genomes each), in the background of closely related strains and the absence of reference genomes, iv) its ability to correctly identify clusters of sequences from the E. coli O104:H4 genome as the most strongly correlated to the infection status in 53 microbiomes sampled from the 2011 STEC outbreak in Germany, and to accurately cluster contigs of this pathogenic strain from a cross-assembly of these 53 microbiomes.
Conclusions: We present a set of sequence clustering ("binning") modules and their application to biomarker (e.g., genomes of pathogenic organisms) discovery from large synthetic and real metagenomics datasets. Initially designed for the "assembly-free" analysis of individual metagenomic samples, we demonstrate their extension to setups involving multiple samples via the usage of the "alignment-free" d2S statistic to relate clusters across samples, and illustrate how the clustering modules can otherwise be leveraged for de novo "pre-assembly" tasks by segregating sequences into biologically meaningful partitions.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4992282 | PMC |
http://dx.doi.org/10.1186/s12859-016-1186-3 | DOI Listing |
J Comput Biol
December 2024
Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan.
Often, bioinformatics uses summary sketches to analyze next-generation sequencing data, but most sketches are not well understood statistically. Under a simple mutation model, Blanca et al. analyzed complete sketches, that is, the complete set of unassembled -mers, from two closely related sequences.
View Article and Find Full Text PDFBMC Genomics
November 2024
College of Life Sciences, Shaanxi Normal University, Xi'an, China.
Background: In evolutionary biology, identifying and quantifying inter-lineage genome size variation and elucidating the underlying causes of that variation have long been goals. Repetitive elements (REs) have been proposed and confirmed as being among the most important contributors to genome size variation. However, the evolutionary implications of genome size variation and RE dynamics are not well understood.
View Article and Find Full Text PDFmSystems
November 2024
Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University, Düsseldorf, Germany.
BMC Genomics
January 2024
Faculty of Biology, Technische Universität Dresden, D-01069, Dresden, Germany.
Background: Despite the many cheap and fast ways to generate genomic data, good and exact genome assembly is still a problem, with especially the repeats being vastly underrepresented and often misassembled. As short reads in low coverage are already sufficient to represent the repeat landscape of any given genome, many read cluster algorithms were brought forward that provide repeat identification and classification. But how can trustworthy, reliable and representative repeat consensuses be derived from unassembled genomes?
Results: Here, we combine methods from repeat identification and genome assembly to derive these robust consensuses.
Med Biol Eng Comput
October 2023
Georgia State University, Atlanta, GA, USA.
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!