Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.

Kevin De Angeli Shang Gao Ioana Danciu Eric B Durbin Xiao-Cheng Wu Antoinette Stroup Jennifer Doherty Stephen Schwartz Charles Wiggins Mark Damesyn Linda Coyle Lynne Penberthy Georgia D Tourassi Hong-Jun Yoon

J Biomed Inform

Oak Ridge National Laboratory, 1 Bethel Valley Rd, Oak Ridge, TN 37830, USA.

Published: January 2022

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9274264	PMC
http://dx.doi.org/10.1016/j.jbi.2021.103957	DOI Listing

Publication Analysis

Top Keywords

class imbalance

cancer types

classification rare

rare cancer

pathology reports

imbalance out-of-distribution

out-of-distribution datasets

datasets improving

improving robustness

robustness textcnn

Similar Publications

Enhanced ResNet-50 for garbage classification: Feature fusion and depth-separable convolutions.

PLoS One

January 2025

School of Business Economics, European Union University, Montreux, Switzerland.

Lingbo Li Runpu Wang Miaojie Zou Fusen Guo Yuheng Ren

As people's material living standards continue to improve, the types and quantities of household garbage they generate rapidly increase. Therefore, it is urgent to develop a reasonable and effective method for garbage classification. This is important for resource recycling and environmental improvement and contributes to the sustainable development of production and the economy.

View Article and Find Full Text PDF

Similar Publications

The Harms of Class Imbalance Corrections for Machine Learning Based Prediction Models: A Simulation Study.

Stat Med

February 2025

Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands.

Alex Carriero Kim Luijken Anne de Hond Karel G M Moons Ben van Calster

Introduction: Risk prediction models are increasingly used in healthcare to aid in clinical decision-making. In most clinical contexts, model calibration (i.e.

View Article and Find Full Text PDF

Similar Publications

Challenges and compromises: Predicting unbound antibody structures with deep learning.

Curr Opin Struct Biol

January 2025

Oxford Protein Informatics Group, Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, United Kingdom.

Alexander Greenshields-Watson Odysseas Vavourakis Fabian C Spoendlin Matteo Cagiada Charlotte M Deane

Therapeutic antibodies are manufactured, stored and administered in the free state; this makes understanding the unbound form key to designing and improving development pipelines. Prediction of unbound antibodies is challenging, specifically modelling of the CDRH3 loop, where inaccuracies are potentially worse due to a bias in structural data towards antibody-antigen complexes. This class imbalance provides a challenge for deep learning models trained on this data, potentially limiting generalisation to unbound forms.

View Article and Find Full Text PDF

Similar Publications

DGMSCL: A dynamic graph mixed supervised contrastive learning approach for class imbalanced multivariate time series classification.

Neural Netw

January 2025

School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430070, Hubei, China.

Lipeng Qian Qiong Zuo Dahu Li Hong Zhu

In the Imbalanced Multivariate Time Series Classification (ImMTSC) task, minority-class instances typically correspond to critical events, such as system faults in power grids or abnormal health occurrences in medical monitoring. Despite being rare and random, these events are highly significant. The dynamic spatial-temporal relationships between minority-class instances and other instances make them more prone to interference from neighboring instances during classification.

View Article and Find Full Text PDF

Similar Publications

Severity grading of hypertensive retinopathy using hybrid deep learning architecture.

Comput Methods Programs Biomed

January 2025

Regional Institute of Ophthalmology, Indira Gandhi Institute of Medical Sciences, Patna, 800025, Bihar, India.

Supriya Suman Anil Kumar Tiwari Shreya Sachan Kuldeep Singh Seema Meena

Background And Objectives: Hypertensive Retinopathy (HR) is a retinal manifestation resulting from persistently elevated blood pressure. Severity grading of HR is essential for patient risk stratification, effective management, progression monitoring, timely intervention, and minimizing the risk of vision impairment. Computer-aided diagnosis and artificial intelligence (AI) systems play vital roles in the diagnosis and grading of HR.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!