Optimal sampling for positive only electronic health record data.

Biometrics

Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

Published: December 2023

Identifying a patient's disease/health status from electronic medical records is a frequently encountered task in electronic health records (EHR) related research, and estimation of a classification model often requires a benchmark training data with patients' known phenotype statuses. However, assessing a patient's phenotype is costly and labor intensive, hence a proper selection of EHR records as a training set is desired. We propose a procedure to tailor the best training subsample with limited sample size for a classification model, minimizing its mean-squared phenotyping/classification error (MSE). Our approach incorporates "positive only" information, an approximation of the true disease status without false alarm, when it is available. In addition, our sampling procedure is applicable for training a chosen classification model which can be misspecified. We provide theoretical justification on its optimality in terms of MSE. The performance gain from our method is illustrated through simulation and a real-data example, and is found often satisfactory under criteria beyond MSE.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10333453PMC
http://dx.doi.org/10.1111/biom.13824DOI Listing

Publication Analysis

Top Keywords

classification model
12
electronic health
8
optimal sampling
4
sampling positive
4
positive electronic
4
health record
4
record data
4
data identifying
4
identifying patient's
4
patient's disease/health
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!