Binning unassembled short reads based on k-mer abundance covariance using sparse coding.

Gigascience

Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Université Paris-Saclay, 2 rue Gaston Crémieux, 91057 Evry, France.

Published: April 2020

Background: Sequence-binning techniques enable the recovery of an increasing number of genomes from complex microbial metagenomes and typically require prior metagenome assembly, incurring the computational cost and drawbacks of the latter, e.g., biases against low-abundance genomes and inability to conveniently assemble multi-terabyte datasets.

Results: We present here a scalable pre-assembly binning scheme (i.e., operating on unassembled short reads) enabling latent genome recovery by leveraging sparse dictionary learning and elastic-net regularization, and its use to recover hundreds of metagenome-assembled genomes, including very low-abundance genomes, from a joint analysis of microbiomes from the LifeLines DEEP population cohort (n = 1,135, >1010 reads).

Conclusion: We showed that sparse coding techniques can be leveraged to carry out read-level binning at large scale and that, despite lower genome reconstruction yields compared to assembly-based approaches, bin-first strategies can complement the more widely used assembly-first protocols by targeting distinct genome segregation profiles. Read enrichment levels across 6 orders of magnitude in relative abundance were observed, indicating that the method has the power to recover genomes consistently segregating at low levels.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7099633PMC
http://dx.doi.org/10.1093/gigascience/giaa028DOI Listing

Publication Analysis

Top Keywords

unassembled short
8
short reads
8
sparse coding
8
low-abundance genomes
8
genomes
5
binning unassembled
4
reads based
4
based k-mer
4
k-mer abundance
4
abundance covariance
4

Similar Publications

Skmer approach improves species discrimination in taxonomically problematic genus (Theaceae).

Plant Divers

November 2024

CAS Key Laboratory for Plant Diversity and Biogeography of East Asia, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming 650201, Yunnan, PR China.

Genome skimming has dramatically extended DNA barcoding from short DNA fragments to next generation barcodes in plants. However, conserved DNA barcoding markers, including complete plastid genome and nuclear ribosomal DNA (nrDNA) sequences, are inadequate for accurate species identification. Skmer, a recently proposed approach that estimates genetic distances among species based on unassembled genome skims, has been proposed to effectively improve species discrimination rate.

View Article and Find Full Text PDF

The naked mole-rat (NMR; ) is a eusocial subterranean rodent with a highly unusual set of physiological traits that has attracted great interest amongst the scientific community. However, the genetic basis of most of these traits has not been elucidated. To facilitate our understanding of the molecular mechanisms underlying NMR physiology and behaviour, we generated a long-read chromosomal-level genome assembly of the NMR.

View Article and Find Full Text PDF

Background: In evolutionary biology, identifying and quantifying inter-lineage genome size variation and elucidating the underlying causes of that variation have long been goals. Repetitive elements (REs) have been proposed and confirmed as being among the most important contributors to genome size variation. However, the evolutionary implications of genome size variation and RE dynamics are not well understood.

View Article and Find Full Text PDF
Article Synopsis
  • NanoCore is a user-friendly tool designed for genomic surveillance using Nanopore sequencing, enabling quick analysis of pathogen transmission in healthcare settings.* -
  • It calculates and visualizes core-genome multilocus sequence typing distances directly from Nanopore reads and is compatible with Illumina data.* -
  • NanoCore demonstrates efficiency and accuracy in comparing bacterial strains, providing results similar to established methods and can be installed easily as free software.*
View Article and Find Full Text PDF

Background: Despite the many cheap and fast ways to generate genomic data, good and exact genome assembly is still a problem, with especially the repeats being vastly underrepresented and often misassembled. As short reads in low coverage are already sufficient to represent the repeat landscape of any given genome, many read cluster algorithms were brought forward that provide repeat identification and classification. But how can trustworthy, reliable and representative repeat consensuses be derived from unassembled genomes?

Results: Here, we combine methods from repeat identification and genome assembly to derive these robust consensuses.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!