A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique N-mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST's tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross-validation analysis demonstrates that species-accuracy achieves 90% for highly-represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2777009 | PMC |
http://dx.doi.org/10.1155/2008/205969 | DOI Listing |
Bioinform Adv
July 2023
Bioinformatics Laboratory (LABINFO), National Laboratory for Scientific Computing, Av. Getulio Vargas, 333-Quitandinha, 25651-076, Rio de Janeiro, Brazil.
Summary: The (, )-mer is a simple alternative classification feature based on conditional probability distributions. In this application note, we compared -mer and (, )-mer frequency features in 11 distinct datasets used for binary, multiclass and clustering classifications. Our findings show that the (, )-mer frequency features are related to the highest performance metrics and often statistically outperformed the -mers.
View Article and Find Full Text PDFJ Comput Biol
October 2014
1 Department of Epidemiology & Biostatistics, Drexel University, Philadelphia, Pennsylvania.
Characterizing the empirical distribution of the frequency of n-mers is a vital step in understanding the entire genome. This will allow for researchers to examine how complex the genome really is, and move beyond simple, traditional modeling frameworks that are often biased in the presence of abundant and/or extremely rare words. We hypothesize that models based on the negative binomial distribution and its zero-inflated counterpart will characterize the n-mer distributions of genomes better than the Poisson.
View Article and Find Full Text PDFJ Phys Condens Matter
August 2011
Instituto de Física, Universidade Federal de Alagoas, Maceió-AL 57072-970, Brazil.
In this paper we study the propagation of acoustic waves in a one-dimensional medium with a short range correlated elasticity distribution. In order to generate local correlations we consider a disordered binary distribution in which the effective elastic constants can take on only two values, η(A) and η(B). We add an additional constraint that the η(A) values appear only in finite segments of length n.
View Article and Find Full Text PDFJ Bioinform Comput Biol
December 2010
School of Electrical and Computer Engineering, Georgia Institute of Technology, 210 Technology Circle, Savannah, GA 31407, USA.
Metagenomics is an emerging field in which the power of genomic analysis is applied to an entire microbial community, bypassing the need to isolate and culture individual microbial species. Assembling of metagenomic DNA fragments is very much like the overlap-layout-consensus procedure for assembling isolated genomes, but is augmented by an additional binning step to differentiate scaffolds, contigs and unassembled reads into various taxonomic groups. In this paper, we employed n-mer oligonucleotide frequencies as the features and developed a hierarchical classifier (PCAHIER) for binning short (≤ 1,000 bps) metagenomic fragments.
View Article and Find Full Text PDFBMC Bioinformatics
July 2010
Department of Biochemistry and Molecular Biology Ernesto Quagliariello, University of Bari, Bari, Italy.
Background: We study the usage of specific peptide platforms in protein composition. Using the pentapeptide as a unit of length, we find that in the universal proteome many pentapeptides are heavily repeated (even thousands of times), whereas some are quite rare, and a small number do not appear at all. To understand the physico-chemical-biological basis underlying peptide usage at the proteomic level, in this study we analyse the energetic costs for the synthesis of rare and never-expressed versus frequent pentapeptides.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!