The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8212138PMC
http://dx.doi.org/10.1016/j.patter.2021.100255DOI Listing

Publication Analysis

Top Keywords

automated de-identification
8
electronic health
8
health records
8
de-identification system
8
building best-in-class
4
best-in-class automated
4
de-identification tool
4
tool electronic
4
records ensemble
4
ensemble learning
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!