Motivation: The recognition of mentions of species names in text is a critically important task for biomedical text mining. While deep learning-based methods have made great advances in many named entity recognition tasks, results for species name recognition remain poor. We hypothesize that this is primarily due to the lack of appropriate corpora.

Results: We introduce the S1000 corpus, a comprehensive manual re-annotation and extension of the S800 corpus. We demonstrate that S1000 makes highly accurate recognition of species names possible (F-score =93.1%), both for deep learning and dictionary-based methods.

Availability And Implementation: All resources introduced in this study are available under open licenses from https://jensenlab.org/resources/s1000/. The webpage contains links to a Zenodo project and three GitHub repositories associated with the study.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10281857PMC
http://dx.doi.org/10.1093/bioinformatics/btad369DOI Listing

Publication Analysis

Top Keywords

species names
8
s1000 better
4
better taxonomic
4
taxonomic corpus
4
corpus biomedical
4
biomedical extraction
4
extraction motivation
4
recognition
4
motivation recognition
4
recognition mentions
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!