Substring selection for biomedical document classification.

Bioinformatics

Center for Information Science and Technology, Temple University, Philadelphia, PA 19122, USA.

Published: September 2006

Motivation: Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Owing to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we propose an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes.

Results: The approach was tested on five annotated sets of abstracts from iProLINK that report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and support vector machine classifiers perform consistently better [with area under the ROC curve (AUC) accuracy in range 0.92-0.97] when using the proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86-0.93 range). The proposed approach is particularly useful when labeled datasets are small.

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btl350DOI Listing

Publication Analysis

Top Keywords

document classification
8
attribute selection
8
substring selection
4
selection biomedical
4
biomedical document
4
classification motivation
4
motivation attribute
4
selection critical
4
critical step
4
step development
4

Similar Publications

Clinical Manifestations.

Alzheimers Dement

December 2024

Frontotemporal Disorders Unit, Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.

Background: Posterior Cortical Atrophy (PCA) is a syndrome characterized by a progressive decline in higher-order visuospatial processing, leading to symptoms such as space perception deficit, simultanagnosia, and object perception impairment. While PCA is primarily known for its impact on visuospatial abilities, recent studies have documented language abnormalities in PCA patients. This study aims to delineate the nature and origin of language impairments in PCA, hypothesizing that language deficits reflect the visuospatial processing impairments of the disease.

View Article and Find Full Text PDF

kMetaShot: a fast and reliable taxonomy classifier for metagenome-assembled genomes.

Brief Bioinform

November 2024

Department of Biosciences, Biotechnology and Environment, University of Bari Aldo Moro, Via E. Orabona 4, 70126, Bari, Italy.

The advent of high-throughput sequencing (HTS) technologies unlocked the complexity of the microbial world through the development of metagenomics, which now provides an unprecedented and comprehensive overview of its taxonomic and functional contribution in a huge variety of macro- and micro-ecosystems. In particular, shotgun metagenomics allows the reconstruction of microbial genomes, through the assembly of reads into MAGs (metagenome-assembled genomes). In fact, MAGs represent an information-rich proxy for inferring the taxonomic composition and the functional contribution of microbiomes, even if the relevant analytical approaches are not trivial and still improvable.

View Article and Find Full Text PDF

Background: Several variants of SARS-CoV-2 have a demonstrated impact on public health, including high and increased transmissibility, severity of infection, and immune escape. Therefore, this study aimed to determine the SARS-CoV-2 lineages and better characterize the dynamics of the pandemic during the different waves in Guinea.

Methods: Whole genome sequencing of 363 samples with PCR cycle threshold (Ct) values under thirty was undertaken between May 2020 and May 2023.

View Article and Find Full Text PDF

Domain-specific vocabulary, which is crucial in fields such as Information Retrieval and Natural Language Processing, requires continuous updates to remain effective. Incremental Learning, unlike conventional methods, updates existing knowledge without retraining from scratch. This paper presents an incremental learning algorithm for updating domain-specific vocabularies.

View Article and Find Full Text PDF

Introduction: George Floyd's death in 2020 galvanised large protests around the country, including the emergence of the Capitol Hill Autonomous Zone (CHAZ) in Seattle, Washington, a non-policed, organised protest region that may have differing injury risks than other regions. We sought to quantitatively describe characteristics of injuries related to protests documented at visits to two nearby major emergency departments, including the only Level 1 trauma centre in the state.

Methods: Using the International Classification of Diseases, 10th Revision code inclusion criteria, we identified 1938 unique patient visits across the two emergency departments from 29 May 2020 and 1 July 2020.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!