A reappraisal of sentence and token splitting for life sciences documents.

Stud Health Technol Inform

Jena University Language and Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Germany.

Published: November 2007

Natural language processing of real-world documents requires several low-level tasks such as splitting a piece of text into its constituent sentences, and splitting each sentence into its constituent tokens to be performed by some preprocessor (prior to linguistic analysis). While this task is often considered as unsophisticated clerical work, in the life sciences domain it poses enormous problems due to complex naming conventions. In this paper, we first introduce an annotation framework for sentence and token splitting underlying a newly constructed sentence- and token-tagged biomedical text corpus. This corpus serves as a training environment and test bed for machine-learning based sentence and token splitters using Conditional Random Fields (CRFs). Our evaluation experiments reveal that CRFs with a rich feature set substantially increase sentence and token detection performance.

Download full-text PDF	Source

Publication Analysis

Top Keywords

sentence token

token splitting

life sciences

reappraisal sentence

token

splitting

splitting life

sciences documents

documents natural

natural language

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered