Predicting Causes of Data Quality Issues in a Clinical Data Research Network.

Ritu Khare Byron J Ruth Matthew Miller Joshua Tucker Levon H Utidjian Hanieh Razzaghi Nandan Patibandla Evanette K Burrows L Charles Bailey

AMIA Jt Summits Transl Sci Proc

Departments of Pediatrics and Biomedical & Health Informatics, The Children's Hospital of Philadelphia, Philadelphia, PA, 19104.

Published: May 2018

Clinical data research networks (CDRNs) invest substantially in identifying and investigating data quality problems. While identification is largely automated, the investigation and resolution are carried out manually at individual institutions. In the PEDSnet CDRN, we found that only approximately 35% of the identified data quality issues are resolvable as they are caused by errors in the extract-transform-load (ETL) code. Nonetheless, with no prior knowledge of issue causes, partner institutions end up spending significant time investigating issues that represent either inherent data characteristics or false alarms. This work investigates whether the causes (ETL, Characteristic, or False alarm) can be predicted before spending time investigating issues. We trained a classifier on the metadata from 10,281 real-world data quality issues, and achieved a cause prediction F1-measure of up to 90%. While initially tested on PEDSnet, the proposed methodology is applicable to other CDRNs facing similar bottlenecks in handling data quality results.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5961770	PMC

Publication Analysis

Top Keywords

data quality

quality issues

clinical data

spending time

time investigating

investigating issues

data

quality

issues

predicting data

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!