Stud Health Technol Inform
April 2011
Terminologies which lack semantic connectivity hamper the effective search in biomedical fact databases and document retrieval systems. We here focus on the integration of two such isolated resources, the term lists from the protein fact database UNIPROT and the indexing vocabulary MESH from the bibliographic database MEDLINE. The generated semantic ties result from string matching and term set inclusion.
View Article and Find Full Text PDFMotivation: The high level of polymorphism associated with the major histocompatibility complex (MHC) poses a challenge to organizing associated bioinformatic data, particularly in the area of hematopoietic stem cell transplantation. Thus, this area of research has great potential to profit from the ongoing development of biomedical ontologies, which offer structure and definition to MHC-data related communication and portability issues.
Results: We introduce the design considerations, methodological foundations and implementational issues underlying MaHCO, an ontology which represents the alleles and encoded molecules of the major histocompatibility complex.
Motivation: The recognition and normalization of textual mentions of gene and protein names is both particularly important and challenging. Its importance lies in the fact that they constitute the crucial conceptual entities in biomedicine. Their recognition and normalization remains a challenging task because of widespread gene name ambiguities within species, across species, with common English words and with medical sublanguage terms.
View Article and Find Full Text PDFStud Health Technol Inform
November 2007
Natural language processing of real-world documents requires several low-level tasks such as splitting a piece of text into its constituent sentences, and splitting each sentence into its constituent tokens to be performed by some preprocessor (prior to linguistic analysis). While this task is often considered as unsophisticated clerical work, in the life sciences domain it poses enormous problems due to complex naming conventions. In this paper, we first introduce an annotation framework for sentence and token splitting underlying a newly constructed sentence- and token-tagged biomedical text corpus.
View Article and Find Full Text PDFThere is a growing need for the general-purpose description of the basic conceptual entities in the life sciences. Up until now, upper level models have mainly been purpose-driven, such as the GENIA ontology, originally devised as a vocabulary for corpus annotation. As an alternative,we here present BioTop, a description-logic-based top level ontology for molecular biology, which we consider as an ontologically conscious redesign of the GENIA ontology.
View Article and Find Full Text PDFThe ever-increasing amount of textual information in biomedicine calls for effective procedures for automatic terminology extraction which assist biomedical researchers and professionals in gathering and organizing terminological knowledge encoded in text documents. In this study, we propose a new, linguistically grounded measure for automatically identifying multi-word terms from the biomedical literature. Our approach is based on the limited paradigmatic modifiability of terms and is tested on bigram, trigram and quadgram noun phrases extracted from a 104-million-word text corpus comprised of Medline abstracts.
View Article and Find Full Text PDFStud Health Technol Inform
June 2005
We compare the performance of two part-of-speech taggers trained on a German newspaper corpus for mixed types of medical documents. TnT, a tagger based on a statistical language model, outperforms Brill's rule-based tagger, and supplied with additional lexicon resources matches state-of-the-art performance figures (close to 97% accuracy) on the medical corpus. We explain this unexpected result by focusing on the statistically significant part-of-speech type overlap between the newspaper training set and the medical test set.
View Article and Find Full Text PDF