Automated annotation of scientific texts for ML-based keyphrase extraction and validation.

Oluwamayowa O Amusat Harshad Hegde Christopher J Mungall Anna Giannakou Neil P Byers Dan Gunter Kjiersten Fagnan Lavanya Ramakrishnan

Database (Oxford)

Scientific Data Division, Lawrence Berkeley National Laboratory, 1 Cyclotron road, Berkeley, CA 94720, United States.

Published: September 2024

Advanced omics technologies produce a large amount of data daily, but often lack the necessary metadata, making it difficult for researchers to effectively access and utilize the data.
Machine learning (ML) techniques are emerging as solutions for automatically annotating these datasets, but the process of text labeling to validate this metadata remains manual and time-consuming, highlighting the need for automation.
This paper presents two new automated text labeling approaches aimed at improving metadata validation in environmental genomics, utilizing relationships between data sources and controlled vocabularies to enhance the efficiency and effectiveness of metadata extraction.

Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lack the essential metadata required for researchers to find, curate, and search them effectively. The lack of metadata poses a significant challenge in the utilization of these data sets. Machine learning (ML)-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific data sets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming and not always feasible; thus, there is a need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining data sets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information that is only available for select documents within a corpus to validate ML models, which can then be used to describe the remaining documents in the corpus. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches in the context of environmental genomics research for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.

Download full-text PDF	Source
http://dx.doi.org/10.1093/database/baae093	DOI Listing

Publication Analysis

Top Keywords

data sets

text labeling

environmental genomics

metadata

automated text

ml-generated metadata

unlabeled texts

documents corpus

data

text

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!