A framework for semisupervised feature generation and its applications in biomedical literature mining.

IEEE/ACM Trans Comput Biol Bioinform

College of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China,

Published: June 2011

Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TCBB.2010.99DOI Listing

Publication Analysis

Top Keywords

unlabeled data
12
biomedical literature
8
literature mining
8
features
8
cdfs unlabeled
8
percent percent
8
unlabeled
5
data
5
framework semisupervised
4
feature
4

Similar Publications

Decoding the mA epitranscriptomic landscape for biotechnological applications using a direct RNA sequencing approach.

Nat Commun

January 2025

National-Local Joint Engineering Laboratory of Druggability and New Drug Evaluation, National Engineering Research Center for New Drug and Druggability (cultivation), Guangdong Province Key Laboratory of New Drug Design and Evaluation, School of Pharmaceutical Sciences, Sun Yat-Sen University, Guangzhou, 510006, China.

Epitranscriptomic modifications, particularly N6-methyladenosine (mA), are crucial regulators of gene expression, influencing processes such as RNA stability, splicing, and translation. Traditional computational methods for detecting mA from Nanopore direct RNA sequencing (DRS) data are constrained by their reliance on experimentally validated labels, often resulting in the underestimation of modification sites. Here, we introduce pum6a, an innovative attention-based framework that integrates positive and unlabeled multi-instance learning (MIL) to address the challenges of incomplete labeling and missing read-level annotations.

View Article and Find Full Text PDF

Objective: Extracting PICO elements-Participants, Intervention, Comparison, and Outcomes-from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PICO entities with fine granularities.

View Article and Find Full Text PDF

Community isolation of patients with communicable infectious diseases limits spread of pathogens but our understanding of isolated patients' needs and challenges is incomplete. Rwanda deployed a digital health service nationally to assist public health clinicians to remotely monitor and support SARS-CoV-2 cases via their mobile phones using daily interactive short message service (SMS) check-ins. We aimed to assess the texting patterns and communicated topics to better understand patient experiences.

View Article and Find Full Text PDF

SD-LayerNet: Robust and label-efficient retinal layer segmentation via anatomical priors.

Comput Methods Programs Biomed

January 2025

Christian Doppler Laboratory for Artificial Intelligence in Retina, Department of Ophthalmology and Optometry, Medical University of Vienna, Vienna, Austria; Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Vienna, Austria.

Background And Objectives: Automated, anatomically coherent retinal layer segmentation in optical coherence tomography (OCT) is one of the most important components of retinal disease management. However, current methods rely on large amounts of labeled data, which can be difficult and expensive to obtain. In addition, these systems tend often propose anatomically impossible results, which undermines their clinical reliability.

View Article and Find Full Text PDF

Motivation: Ensuring connectivity and preventing fractures in tubular object segmentation are critical for downstream analyses. Despite advancements in deep neural networks (DNNs) that have significantly improved tubular object segmentation, existing methods still face limitations. They often rely heavily on precise annotations, hindering their scalability to large-scale unlabeled image datasets.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!