This study aimed to reduce reliance on large training datasets in support vector machine (SVM)-based clinical text analysis by categorizing keyword features. An enhanced Mayo smoking status detection pipeline was deployed. We used a corpus of 709 annotated patient narratives. The pipeline was optimized for local data entry practice and lexicon. SVM classifier retraining used a grouped keyword approach for better efficiency. Accuracy, precision, and F-measure of the unaltered and optimized pipelines were evaluated using k-fold cross-validation. Initial accuracy of the clinical Text Analysis and Knowledge Extraction System (cTAKES) package was 0.69. Localization and keyword grouping improved system accuracy to 0.9 and 0.92, respectively. F-measures for current and past smoker classes improved from 0.43 to 0.81 and 0.71 to 0.91, respectively. Non-smoker and unknown-class F-measures were 0.96 and 0.98, respectively. Keyword grouping had no negative effect on performance, and decreased training time. Grouping keywords is a practical method to reduce training corpus size.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3912731 | PMC |
http://dx.doi.org/10.1136/amiajnl-2013-002090 | DOI Listing |
BMC Med Inform Decis Mak
January 2025
Great Ormond Street Institute of Child Health, University College London, London, UK.
Introduction: Unsupervised feature learning methods inspired by natural language processing (NLP) models are capable of constructing patient-specific features from longitudinal Electronic Health Records (EHR).
Design: We applied document embedding algorithms to real-world paediatric intensive care (PICU) EHR data to extract patient-specific features from 1853 patients' PICU journeys using 647 unique lab tests and medication events. We evaluated the clinical utility of the patient features via a K-means clustering analysis.
Comput Methods Programs Biomed
January 2025
Computational Biomedicine Unit, Department of Medical Sciences, University of Torino, Via Santena 19, 10126, Torino, Italy.
Background And Objectives: Several computational pipelines for biomedical data have been proposed to stratify patients and to predict their prognosis through survival analysis. However, these analyses are usually performed independently, without integrating the information derived from each of them. Clustering of survival data is an underexplored problem, and current approaches are limited for biomedical applications, whose data are usually heterogeneous and multimodal, with poor scalability for high-dimensionality.
View Article and Find Full Text PDFPLoS One
January 2025
Trinity Centre for Biomedical Engineering, Trinity College Dublin, Dublin, Ireland.
Electroencephalographic signals are obtained by amplifying and recording the brain's spontaneous biological potential using electrodes positioned on the scalp. While proven to help find changes in brain activity with a high temporal resolution, such signals are contaminated by non-stationary and frequent artefacts. A plethora of noise reduction techniques have been developed, achieving remarkable performance.
View Article and Find Full Text PDFmSystems
January 2025
Department of Gastroenterology, Hepatology and Infectious Diseases, Otto-von-Guericke University Magdeburg, Magdeburg, Germany.
Microbiome analysis has become a crucial tool for basic and translational research due to its potential for translation into clinical practice. However, there is ongoing controversy regarding the comparability of different bioinformatic analysis platforms and a lack of recognized standards, which might have an impact on the translational potential of results. This study investigates how the performance of different microbiome analysis platforms impacts the final results of mucosal microbiome signatures.
View Article and Find Full Text PDFGenome Med
January 2025
Department of Epidemiology of Microbial Disease, Yale School of Public Health, 60 College Street, New Haven, CT, USA.
Background: Mixed infection with multiple strains of the same pathogen in a single host can present clinical and analytical challenges. Whole genome sequence (WGS) data can identify signals of multiple strains in samples, though the precision of previous methods can be improved. Here, we present MixInfect2, a new tool to accurately detect mixed samples from Mycobacterium tuberculosis short-read WGS data.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!