Purpose: Real-world data (RWD) derived from electronic health records (EHRs) are often used to understand population-level relationships between patient characteristics and cancer outcomes. Machine learning (ML) methods enable researchers to extract characteristics from unstructured clinical notes, and represent a more cost-effective and scalable approach than manual expert abstraction. These extracted data are then used in epidemiologic or statistical models as if they were abstracted observations. Analytical results derived from extracted data in this way may differ from those given by abstracted data, and the magnitude of this difference is not directly informed by standard ML performance metrics.

Methods: In this paper, we define the task of postprediction inference, which is to recover similar estimation and inference from an ML-extracted variable that would be obtained from abstracting the variable. We consider fitting a Cox proportional hazards model that uses a binary ML-extracted variable as a covariate and evaluate four approaches for postprediction inference in this setting. The first two approaches only require the ML-predicted probability, while the latter two additionally require a labeled (human abstracted) validation data set.

Results: Our results for both simulated data and EHR-derived RWD from a national cohort demonstrate that we can improve inference from ML-extracted variables by leveraging a limited amount of labeled data.

Conclusion: We describe and evaluate methods for fitting statistical models using ML-extracted variables subject to model error. We show that estimation and inference is generally valid when using extracted data from high-performing ML models. More complex methods that incorporate auxiliary labeled data provide further improvements.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10281422PMC
http://dx.doi.org/10.1200/CCI.22.00174DOI Listing

Publication Analysis

Top Keywords

postprediction inference
12
extracted data
12
machine learning
8
electronic health
8
health records
8
data
8
statistical models
8
estimation inference
8
inference ml-extracted
8
ml-extracted variable
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!