Background: We prepared and evaluated training and test materials for an assessment of text mining methods in molecular biology. The goal of the assessment was to evaluate the ability of automated systems to generate a list of unique gene identifiers from PubMed abstracts for the three model organisms Fly, Mouse, and Yeast. This paper describes the preparation and evaluation of answer keys for training and testing. These consisted of lists of normalized gene names found in the abstracts, generated by adapting the gene list for the full journal articles found in the model organism databases. For the training dataset, the gene list was pruned automatically to remove gene names not found in the abstract; for the testing dataset, it was further refined by manual annotation by annotators provided with guidelines. A critical step in interpreting the results of an assessment is to evaluate the quality of the data preparation. We did this by careful assessment of interannotator agreement and the use of answer pooling of participant results to improve the quality of the final testing dataset.
Results: Interannotator analysis on a small dataset showed that our gene lists for Fly and Yeast were good (87% and 91% three-way agreement) but the Mouse gene list had many conflicts (mostly omissions), which resulted in errors (69% interannotator agreement). By comparing and pooling answers from the participant systems, we were able to add an additional check on the test data; this allowed us to find additional errors, especially in Mouse. This led to 1% change in the Yeast and Fly "gold standard" answer keys, but to an 8% change in the mouse answer key.
Conclusion: We found that clear annotation guidelines are important, along with careful interannotator experiments, to validate the generated gene lists. Also, abstracts alone are a poor resource for identifying genes in paper, containing only a fraction of genes mentioned in the full text (25% for Fly, 36% for Mouse). We found that there are intrinsic differences between the model organism databases related to the number of synonymous terms and also to curation criteria. Finally, we found that answer pooling was much faster and allowed us to identify more conflicting genes than interannotator analysis.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1869005 | PMC |
http://dx.doi.org/10.1186/1471-2105-6-S1-S12 | DOI Listing |
PeerJ Comput Sci
October 2024
School of Computer Science and Engineering, Yeungnam University, Gyeongsan, South Korea.
Predicting court rulings has gained attention over the past years. The court rulings are among the most important documents in all legal systems, profoundly impacting the lives of the children in case of divorce or separation. It is evident from literature that Natural language processing (NLP) and machine learning (ML) are widely used in the prediction of court rulings.
View Article and Find Full Text PDFSci Data
November 2024
Institute of Digital Games, University of Malta, Msida, Malta.
As online video and streaming platforms continue to grow, affective computing research has undergone a shift towards more complex studies involving multiple modalities. However, there is still a lack of readily available datasets with high-quality audiovisual stimuli. In this paper, we present GameVibe, a novel affect corpus which consists of multimodal audiovisual stimuli, including in-game behavioural observations and third-person affect traces for viewer engagement.
View Article and Find Full Text PDFAdvance care planning, involving goals-of-care and surrogate-designation conversations, is crucial for patient-centered care. However, determining the optimal timing and participants for these conversations remains challenging. This study explored the frequency, timing, and predictors of documenting two advance care planning elements, goals-of-care and surrogate-designation conversations, in clinical notes for patients with advanced illness.
View Article and Find Full Text PDFAoB Plants
October 2024
Center for Quantitative Genetics and Genomics, Aarhus University, Slagelse 4200, Denmark.
Measuring seminal root angle is an important aspect of root phenotyping, yet automated methods are lacking. We introduce SeminalRootAngle, a novel open-source automated method that measures seminal root angles from images. To ensure our method is flexible and user-friendly we build on an established corrective annotation training method for image segmentation.
View Article and Find Full Text PDFHealth Informatics J
October 2024
Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!