Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/TCBB.2010.99 | DOI Listing |
Nat Commun
January 2025
National-Local Joint Engineering Laboratory of Druggability and New Drug Evaluation, National Engineering Research Center for New Drug and Druggability (cultivation), Guangdong Province Key Laboratory of New Drug Design and Evaluation, School of Pharmaceutical Sciences, Sun Yat-Sen University, Guangzhou, 510006, China.
Epitranscriptomic modifications, particularly N6-methyladenosine (mA), are crucial regulators of gene expression, influencing processes such as RNA stability, splicing, and translation. Traditional computational methods for detecting mA from Nanopore direct RNA sequencing (DRS) data are constrained by their reliance on experimentally validated labels, often resulting in the underestimation of modification sites. Here, we introduce pum6a, an innovative attention-based framework that integrates positive and unlabeled multi-instance learning (MIL) to address the challenges of incomplete labeling and missing read-level annotations.
View Article and Find Full Text PDFJ Am Med Inform Assoc
January 2025
Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States.
Objective: Extracting PICO elements-Participants, Intervention, Comparison, and Outcomes-from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PICO entities with fine granularities.
View Article and Find Full Text PDFPLOS Digit Health
January 2025
Rwanda Ministry of Health, Kigali, Rwanda.
Community isolation of patients with communicable infectious diseases limits spread of pathogens but our understanding of isolated patients' needs and challenges is incomplete. Rwanda deployed a digital health service nationally to assist public health clinicians to remotely monitor and support SARS-CoV-2 cases via their mobile phones using daily interactive short message service (SMS) check-ins. We aimed to assess the texting patterns and communicated topics to better understand patient experiences.
View Article and Find Full Text PDFComput Methods Programs Biomed
January 2025
Christian Doppler Laboratory for Artificial Intelligence in Retina, Department of Ophthalmology and Optometry, Medical University of Vienna, Vienna, Austria; Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Vienna, Austria.
Background And Objectives: Automated, anatomically coherent retinal layer segmentation in optical coherence tomography (OCT) is one of the most important components of retinal disease management. However, current methods rely on large amounts of labeled data, which can be difficult and expensive to obtain. In addition, these systems tend often propose anatomically impossible results, which undermines their clinical reliability.
View Article and Find Full Text PDFBioinformatics
January 2025
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China.
Motivation: Ensuring connectivity and preventing fractures in tubular object segmentation are critical for downstream analyses. Despite advancements in deep neural networks (DNNs) that have significantly improved tubular object segmentation, existing methods still face limitations. They often rely heavily on precise annotations, hindering their scalability to large-scale unlabeled image datasets.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!