High-recall protein entity recognition using a dictionary.

Bioinformatics

Center for Automated Learning and Discovery, Carnegie Mellon University Pittsburgh, PA 15213, USA.

Published: June 2005

Summary: Protein name extraction is an important step in mining biological literature. We describe two new methods for this task: semiCRFs and dictionary HMMs. SemiCRFs are a recently-proposed extension to conditional random fields (CRFs) that enables more effective use of dictionary information as features. Dictionary HMMs are a technique in which a dictionary is converted to a large HMM that recognizes phrases from the dictionary, as well as variations of these phrases. Standard training methods for HMMs can be used to learn which variants should be recognized. We compared the performance of our new approaches with that of Maximum Entropy (MaxEnt) and normal CRFs on three datasets, and improvement was obtained for all four methods over the best published results for two of the datasets. CRFs and semiCRFs achieved the highest overall performance according to the widely-used F-measure, while the dictionary HMMs performed the best at finding entities that actually appear in the dictionary-the measure of most interest in our intended application.

Availability: Dictionary HMMs were implemented in Java. Algorithms are available through an information extraction package MINORTHIRD on http://minorthird.sourceforge.net

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2857312PMC
http://dx.doi.org/10.1093/bioinformatics/bti1006DOI Listing

Publication Analysis

Top Keywords

dictionary hmms
16
dictionary
8
hmms
5
high-recall protein
4
protein entity
4
entity recognition
4
recognition dictionary
4
dictionary summary
4
summary protein
4
protein extraction
4

Similar Publications

Summary: Protein name extraction is an important step in mining biological literature. We describe two new methods for this task: semiCRFs and dictionary HMMs. SemiCRFs are a recently-proposed extension to conditional random fields (CRFs) that enables more effective use of dictionary information as features.

View Article and Find Full Text PDF

Domain parsing, or the detection of signals of protein structural domains from sequence data, is a complex and difficult problem. If carried out reliably it would be a powerful interpretive and predictive tool for genomic and proteomic studies. We report on a novel approach to domain parsing using consensus techniques based on Hidden Markov Models (HMMs) and BLAST searches built from a training set of 1471 continuous structural domains from the Dali Domain Dictionary (DDD).

View Article and Find Full Text PDF

Recognition of human genes by stochastic parsing.

Pac Symp Biocomput

October 1998

Genome Informatics Group, Electrotechnical Laboratories, Tsukuba, Japan.

A gene finding system, GeneDecoder, based on a parsing technique using a stochastic grammar and dictionary of genetic words is introduced. The structure of human genes are expressed by a stochastic grammar and a dictionary, whose components are the genetic words consisting of genetic phonemes, built as hidden Markov models (HMMs). The HMMs represent the nucleotide acid bases, the codons, and the amino acids.

View Article and Find Full Text PDF

Hidden Markov Models (HMMs) are applied to the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated on the globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!