Corpus refactoring: a feasibility study.

Helen L Johnson William A Baumgartner Martin Krallinger K Bretonnel Cohen Lawrence Hunter

J Biomed Discov Collab

Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, CO, USA.

Published: September 2007

Background: Most biomedical corpora have not been used outside of the lab that created them, despite the fact that the availability of the gold-standard evaluation data that they provide is one of the rate-limiting factors for the progress of biomedical text mining. Data suggest that one major factor affecting the use of a corpus outside of its home laboratory is the format in which it is distributed. This paper tests the hypothesis that corpus refactoring - changing the format of a corpus without altering its semantics - is a feasible goal, namely that it can be accomplished with a semi-automatable process and in a time-effcient way. We used simple text processing methods and limited human validation to convert the Protein Design Group corpus into two new formats: WordFreak and embedded XML. We tracked the total time expended and the success rates of the automated steps.

Results: The refactored corpus is available for download at the BioNLP SourceForge website http://bionlp.sourceforge.net. The total time expended was just over three person-weeks, consisting of about 102 hours of programming time (much of which is one-time development cost) and 20 hours of manual validation of automatic outputs. Additionally, the steps required to refactor any corpus are presented.

Conclusion: We conclude that refactoring of publicly available corpora is a technically and economically feasible method for increasing the usage of data already available for evaluating biomedical language processing systems.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2072937	PMC
http://dx.doi.org/10.1186/1747-5333-2-4	DOI Listing

Publication Analysis

Top Keywords

corpus refactoring

total time

time expended

corpus

refactoring feasibility

feasibility study

study background

background biomedical

biomedical corpora

corpora lab

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered