Motivation: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research.
Results: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization.
Manual curation has long been used for extracting key information found within the primary literature for input into biological databases. The human immunodeficiency virus type 1 (HIV-1), human protein interaction database (HHPID), for example, contains 2589 manually extracted interactions, linked to 14,312 mentions in 3090 articles. The advancement of text-mining (TM) techniques has offered a possibility to rapidly retrieve such data from large volumes of text to a high degree of accuracy.
View Article and Find Full Text PDFBackground: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set.
View Article and Find Full Text PDFBackground: The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.
View Article and Find Full Text PDFSummary: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text.
View Article and Find Full Text PDFMotivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study.
View Article and Find Full Text PDFNephronophthisis is a heterogenetic autosomal recessive disorder associated with multiple developmental abnormalities, including cystic kidney disease and retinal degeneration. Retinal dystrophies, in particular the X-linked forms, are believed to represent a distinct group of hereditary diseases; however, their genetic complexity and overlap with other syndromic diseases is increasingly apparent. In this study, we report that depletion of retinitis pigmentosa GTPase regulator (RPGR) during zebrafish embryogenesis causes developmental changes indistinguishable from the abnormalities caused by the depletion of nephrocystin-5 or nephrocystin-6.
View Article and Find Full Text PDFBMC Bioinformatics
February 2010
Background: The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles.
Results: In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions.
Nephronophthisis (NPHP) is an autosomal recessive cystic kidney disease, caused by mutations of at least nine different genes. Several extrarenal manifestations characterize this disorder, including cerebellar defects, situs inversus and retinitis pigmentosa. While the clinical manifestations vary significantly in NPHP, mutations of NPHP5 and NPHP6 are always associated with progressive blindness.
View Article and Find Full Text PDF