A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes.

BMC Bioinformatics

Commissariat à l'Energie Atomique et aux Energies Alternatives, Direction de la Recherche Fondamentale, Institut de Génomique, Genoscope, Evry, Essonne, 91057, France.

Published: August 2016

Background: Metagenomics holds great promises for deepening our knowledge of key bacterial driven processes, but metagenome assembly remains problematic, typically resulting in representation biases and discarding significant amounts of non-redundant sequence information. In order to alleviate constraints assembly can impose on downstream analyses, and/or to increase the fraction of raw reads assembled via targeted assemblies relying on pre-assembly binning steps, we developed a set of binning modules and evaluated their combination in a new "assembly-free" binning protocol.

Results: We describe a scalable multi-tiered binning algorithm that combines frequency and compositional features to cluster unassembled reads, and demonstrate i) significant runtime performance gains of the developed modules against state of the art software, obtained through parallelization and the efficient use of large lock-free concurrent hash maps, ii) its relevance for clustering unassembled reads from high complexity (e.g., harboring 700 distinct genomes) samples, iii) its relevance to experimental setups involving multiple samples, through a use case consisting in the "de novo" identification of sequences from a target genome (e.g., a pathogenic strain) segregating at low levels in a cohort of 50 complex microbiomes (harboring 100 distinct genomes each), in the background of closely related strains and the absence of reference genomes, iv) its ability to correctly identify clusters of sequences from the E. coli O104:H4 genome as the most strongly correlated to the infection status in 53 microbiomes sampled from the 2011 STEC outbreak in Germany, and to accurately cluster contigs of this pathogenic strain from a cross-assembly of these 53 microbiomes.

Conclusions: We present a set of sequence clustering ("binning") modules and their application to biomarker (e.g., genomes of pathogenic organisms) discovery from large synthetic and real metagenomics datasets. Initially designed for the "assembly-free" analysis of individual metagenomic samples, we demonstrate their extension to setups involving multiple samples via the usage of the "alignment-free" d2S statistic to relate clusters across samples, and illustrate how the clustering modules can otherwise be leveraged for de novo "pre-assembly" tasks by segregating sequences into biologically meaningful partitions.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4992282PMC
http://dx.doi.org/10.1186/s12859-016-1186-3DOI Listing

Publication Analysis

Top Keywords

unassembled reads
8
distinct genomes
8
setups involving
8
involving multiple
8
multiple samples
8
pathogenic strain
8
samples
5
scalable assembly-free
4
assembly-free variable
4
variable selection
4

Similar Publications

Often, bioinformatics uses summary sketches to analyze next-generation sequencing data, but most sketches are not well understood statistically. Under a simple mutation model, Blanca et al. analyzed complete sketches, that is, the complete set of unassembled -mers, from two closely related sequences.

View Article and Find Full Text PDF

Background: In evolutionary biology, identifying and quantifying inter-lineage genome size variation and elucidating the underlying causes of that variation have long been goals. Repetitive elements (REs) have been proposed and confirmed as being among the most important contributors to genome size variation. However, the evolutionary implications of genome size variation and RE dynamics are not well understood.

View Article and Find Full Text PDF
Article Synopsis
  • NanoCore is a user-friendly tool designed for genomic surveillance using Nanopore sequencing, enabling quick analysis of pathogen transmission in healthcare settings.* -
  • It calculates and visualizes core-genome multilocus sequence typing distances directly from Nanopore reads and is compatible with Illumina data.* -
  • NanoCore demonstrates efficiency and accuracy in comparing bacterial strains, providing results similar to established methods and can be installed easily as free software.*
View Article and Find Full Text PDF

Background: Despite the many cheap and fast ways to generate genomic data, good and exact genome assembly is still a problem, with especially the repeats being vastly underrepresented and often misassembled. As short reads in low coverage are already sufficient to represent the repeat landscape of any given genome, many read cluster algorithms were brought forward that provide repeat identification and classification. But how can trustworthy, reliable and representative repeat consensuses be derived from unassembled genomes?

Results: Here, we combine methods from repeat identification and genome assembly to derive these robust consensuses.

View Article and Find Full Text PDF

The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!