Utilizing image and caption information for biomedical document classification.

Pengyuan Li Xiangying Jiang Gongbo Zhang Juan Trelles Trabucco Daniela Raciti Cynthia Smith Martin Ringwald G Elisabeta Marai Cecilia Arighi Hagit Shatkay

Bioinformatics

Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA.

Published: July 2021

Motivation: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature-a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results.

Results: We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance.

Availability And Implementation: Source code and the list of PMIDs of the publications in our datasets are available upon request.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8346654	PMC
http://dx.doi.org/10.1093/bioinformatics/btab331	DOI Listing

Publication Analysis

Top Keywords

document classification

captions titles-and-abstracts

image caption

relevant specific

publications

utilizing image

biomedical

caption biomedical

document

biomedical document

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered