Metagenome fragment classification using N-mer frequency profiles.

Adv Bioinformatics

Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104, USA.

Published: July 2011

A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique N-mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST's tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross-validation analysis demonstrates that species-accuracy achieves 90% for highly-represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2777009PMC
http://dx.doi.org/10.1155/2008/205969DOI Listing

Publication Analysis

Top Keywords

n-mer frequency
8
frequency profiles
8
metagenome fragment
4
fragment classification
4
classification n-mer
4
profiles vast
4
vast amount
4
amount microbial
4
microbial sequencing
4
sequencing data
4

Similar Publications

(, )-mer-a simple statistical feature for sequence classification.

Bioinform Adv

July 2023

Bioinformatics Laboratory (LABINFO), National Laboratory for Scientific Computing, Av. Getulio Vargas, 333-Quitandinha, 25651-076, Rio de Janeiro, Brazil.

Summary: The (, )-mer is a simple alternative classification feature based on conditional probability distributions. In this application note, we compared -mer and (, )-mer frequency features in 11 distinct datasets used for binary, multiclass and clustering classifications. Our findings show that the (, )-mer frequency features are related to the highest performance metrics and often statistically outperformed the -mers.

View Article and Find Full Text PDF

Characterizing the empirical distribution of the frequency of n-mers is a vital step in understanding the entire genome. This will allow for researchers to examine how complex the genome really is, and move beyond simple, traditional modeling frameworks that are often biased in the presence of abundant and/or extremely rare words. We hypothesize that models based on the negative binomial distribution and its zero-inflated counterpart will characterize the n-mer distributions of genomes better than the Poisson.

View Article and Find Full Text PDF

Extended acoustic modes in random systems with n-mer short range correlations.

J Phys Condens Matter

August 2011

Instituto de Física, Universidade Federal de Alagoas, Maceió-AL 57072-970, Brazil.

In this paper we study the propagation of acoustic waves in a one-dimensional medium with a short range correlated elasticity distribution. In order to generate local correlations we consider a disordered binary distribution in which the effective elastic constants can take on only two values, η(A) and η(B). We add an additional constraint that the η(A) values appear only in finite segments of length n.

View Article and Find Full Text PDF

Short prokaryotic DNA fragment binning using a hierarchical classifier based on linear discriminant analysis and principal component analysis.

J Bioinform Comput Biol

December 2010

School of Electrical and Computer Engineering, Georgia Institute of Technology, 210 Technology Circle, Savannah, GA 31407, USA.

Metagenomics is an emerging field in which the power of genomic analysis is applied to an entire microbial community, bypassing the need to isolate and culture individual microbial species. Assembling of metagenomic DNA fragments is very much like the overlap-layout-consensus procedure for assembling isolated genomes, but is augmented by an additional binning step to differentiate scaffolds, contigs and unassembled reads into various taxonomic groups. In this paper, we employed n-mer oligonucleotide frequencies as the features and developed a hierarchical classifier (PCAHIER) for binning short (≤ 1,000 bps) metagenomic fragments.

View Article and Find Full Text PDF

Background: We study the usage of specific peptide platforms in protein composition. Using the pentapeptide as a unit of length, we find that in the universal proteome many pentapeptides are heavily repeated (even thousands of times), whereas some are quite rare, and a small number do not appear at all. To understand the physico-chemical-biological basis underlying peptide usage at the proteomic level, in this study we analyse the energetic costs for the synthesis of rare and never-expressed versus frequent pentapeptides.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!