Reusing massive collections of publicly available biomedical data can significantly impact knowledge discovery. However, these public samples and studies are typically described using unstructured plain text, hindering the findability and further reuse of the data. To combat this problem, we propose txt2onto 2.0, a general-purpose method based on natural language processing and machine learning for annotating biomedical unstructured metadata to controlled vocabularies of diseases and tissues. Compared to the previous version (txt2onto 1.0), which uses numerical embeddings as features, this new version uses words as features, resulting in improved interpretability and performance, especially when few positive training instances are available. Txt2onto 2.0 uses embeddings from a large language model during prediction to deal with unseen-yet-relevant words related to each disease and tissue term being predicted from the input text, thereby explaining the basis of every annotation. We demonstrate the generalizability of txt2onto 2.0 by accurately predicting disease annotations for studies from independent datasets, using proteomics and clinical trials as examples. Overall, our approach can annotate biomedical text regardless of experimental types or sources. Code, data, and trained models are available at https://github.com/krishnanlab/txt2onto2.0.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1093/bib/bbae652 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!