Effect of incremental feature enrichment on healthcare text classification system: A machine learning paradigm.

Comput Methods Programs Biomed

Advanced Knowledge Engineering Center, Global Biomedical Technologies, Inc., Roseville, CA, USA. Electronic address:

Published: April 2019

AI Article Synopsis

Article Abstract

Background And Objective: Healthcare tweets are particularly challenging due to its sparse layout and its limited character size. Compared to previous method based on "bag of words" (BOW) model, this study uniquely identifies the enrichment protocol and learns how semantically different aspects of feature selection such as BOW (feature F0), term frequency inverse document frequency (TF-IDF, feature F1), and latent semantic indexing (LSI, feature F2) when applied sequentially with classifier improves the overall performance.

Methods: To study this enrichment concept, our ML model is tested on two kinds of diverse data sets: (i) D1: Disease data with conjunctivitis, diarrhea, stomach ache, cough and nausea related tweets, and (ii) D2: WebKB4 dataset, while adapting three kind of classifiers (a) C1: support vector machine with radial basis function (SVMR), (b) C2: Multi-layer perceptron (MLP) and (c) C3: Random Forest (RF). Partition protocol (K10) was adapted with different performance metrics to evaluate machine learning (ML)-system.

Results: Using the combination of F1, C1, D1, K10, ML accuracy was: 94%, while with F2, C1, D1, K10, ML accuracy was 97%. Using the incremental feature enrichment from F0 to F2, K10 protocol gave F1 improvement over F0 by 4.98% on Disease dataset, while F2 improvement over F0 was by 11.78% on WebKB4 dataset. We demonstrated the generalization over memorization process in our ML-design. The system was tested for stability and reliability.

Conclusions: We conclude that semantically different aspects of feature selection, when adapted sequentially, leads to improvement in ML-accuracy for healthcare data sets. We validated the system by taking non-healthcare data sets.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.cmpb.2019.01.011DOI Listing

Publication Analysis

Top Keywords

data sets
12
incremental feature
8
feature enrichment
8
machine learning
8
semantically aspects
8
aspects feature
8
feature selection
8
webkb4 dataset
8
k10 accuracy
8
feature
6

Similar Publications

Microfluidic and Computational Tools for Neurodegeneration Studies.

Annu Rev Chem Biomol Eng

January 2025

1Department of Chemical & Biomolecular Engineering, North Carolina State University, Raleigh, North Carolina, USA; email:

Understanding the molecular, cellular, and physiological components of neurodegenerative diseases (NDs) is paramount for developing accurate diagnostics and efficacious therapies. However, the complexity of ND pathology and the limitations associated with conventional analytical methods undermine research. Fortunately, microfluidic technology can facilitate discoveries through improved biomarker quantification, brain organoid culture, and small animal model manipulation.

View Article and Find Full Text PDF

Similarities and differences in waste composition over time and space determined by multivariate distance analyses.

PLoS One

January 2025

Waste Data and Analysis Center, Department of Technology & Society, Stony Brook University, Stony Brook, New York, United States of America.

The composition of solid waste affects technology choices and policy decisions regarding its management. Analyses of waste composition studies are almost always made on a parameter by parameter basis. Multivariate distance techniques can create wholisitic determinations of similarities and differences and were applied here to enhance a series of waste composition comparisons.

View Article and Find Full Text PDF

The emerging crop Camelina sativa (L.) Crantz (camelina) is a Brassicaceae oilseed with a rapidly growing reputation for the deployment of advanced lipid biotechnology and metabolic engineering. Camelina is recognised by agronomists for its traits including yield, oil/protein content, drought tolerance, limited input requirements, plasticity and resilience.

View Article and Find Full Text PDF

Transferability of Single- and Cross-Tissue Transcriptome Imputation Models Across Ancestry Groups.

Genet Epidemiol

January 2025

Centre for Genetics and Genomics Versus Arthritis, Centre for Musculoskeletal Research, Division of Musculoskeletal and Dermatological Sciences, The University of Manchester, Manchester, UK.

Transcriptome-wide association studies (TWAS) investigate the links between genetically regulated gene expression and complex traits. TWAS involves imputing gene expression using expression quantitative trait loci (eQTL) as predictors and testing the association between the imputed expression and the trait. The effectiveness of TWAS depends on the accuracy of these imputation models, which require genotype and gene expression data from the same samples.

View Article and Find Full Text PDF

Piloting a minimum data set for older people living in care homes in England: a developmental study.

Age Ageing

January 2025

Centre for Research in Public Health and Community Care (CRIPACC), University of Hertfordshire, College Lane, Hatfield, UK.

Background: We developed a prototype minimum data set (MDS) for English care homes, assessing feasibility of extracting data directly from digital care records (DCRs) with linkage to health and social care data.

Methods: Through stakeholder development workshops, literature reviews, surveys and public consultation, we developed an aspirational MDS. We identified ways to extract this from existing sources, including DCRs and routine health and social care datasets.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!