Suite of tools for statistical N-gram language modeling for pattern mining in whole genome sequences.

J Bioinform Comput Biol

Department of Biomedical Informatics, University of Pittsburgh, 5607 Baum Boulevard, Suite BAUM 423, Pittsburgh, PA 15206-3701, USA.

Published: December 2012

Genome sequences contain a number of patterns that have biomedical significance. Repetitive sequences of various kinds are a primary component of most of the genomic sequence patterns. We extended the suffix-array based Biological Language Modeling Toolkit to compute n-gram frequencies as well as n-gram language-model based perplexity in windows over the whole genome sequence to find biologically relevant patterns. We present the suite of tools and their application for analysis on whole human genome sequence.

Download full-text PDF

Source
http://dx.doi.org/10.1142/S0219720012500163DOI Listing

Publication Analysis

Top Keywords

suite tools
8
language modeling
8
genome sequences
8
genome sequence
8
tools statistical
4
statistical n-gram
4
n-gram language
4
modeling pattern
4
pattern mining
4
genome
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!