A study of deep learning methods for de-identification of clinical notes in cross-institute settings.

BMC Med Inform Decis Mak

Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Clinical and Translational Research Building 2004 Mowry Road, PO Box 100177, Gainesville, Florida, USA.

Published: December 2019

Background: De-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions.

Methods: We created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources.

Results: Pre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively.

Conclusions: It is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6894104PMC
http://dx.doi.org/10.1186/s12911-019-0935-4DOI Listing

Publication Analysis

Top Keywords

clinical text
20
cross-institute settings
16
deep learning
12
clinical notes
12
deep learning-based
12
learning-based de-identification
12
de-identification models
12
de-identification
10
clinical
10
de-identification clinical
8

Similar Publications

Prenatal metal(loid) exposure and preterm birth: a systematic review of the epidemiologic evidence.

J Expo Sci Environ Epidemiol

January 2025

Department of Environmental Sciences & Engineering, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.

Background: Preterm birth (PTB) is a common pregnancy complication associated with significant neonatal morbidity. Prenatal exposure to environmental chemicals, including toxic and/or essential metal(loid)s, may contribute to PTB risk.

Objective: We aimed to summarize the epidemiologic evidence of the associations among levels of arsenic (As), cadmium (Cd), chromium (Cr), copper (Cu), mercury (Hg), manganese (Mn), lead (Pb), and zinc (Zn) assessed during the prenatal period and PTB or gestational age at delivery; to assess the quality of the literature and strength of evidence for an effect for each metal; and to provide recommendations for future research.

View Article and Find Full Text PDF

Introduction: Non-adherence to tuberculosis (TB) treatment poses a significant challenge to effective TB management globally and is a major contributor to the emergence of multidrug-resistant TB. Although adherence to TB treatment has been widely studied, a comprehensive evaluation of the comparative levels of adherence in high- versus low-TB burden settings remains lacking. The objective of this systematic review and meta-analysis is to assess the levels of adherence to TB treatment in high-TB burden countries compared to low-burden countries.

View Article and Find Full Text PDF

Objective: Extracting named entities from clinical free-text presents unique challenges, particularly when dealing with discontinuous entities-mentions that are separated by unrelated words. Traditional NER methods often struggle to accurately identify these entities, prompting the development of specialised computational solutions. This paper systematically reviews and presents the methodologies developed for Discontinuous Named Entity Recognition in clinical texts, highlighting their effectiveness and the challenges they face.

View Article and Find Full Text PDF

ARCH: Large-scale knowledge graph via aggregated narrative codified health records analysis.

J Biomed Inform

January 2025

Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, 02115, MA, USA; VA Boston Healthcare System, 150 S Huntington Ave, Boston, 02130, MA, USA. Electronic address:

Objective: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes (NLP). The complexity of EHR presents challenges in feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features.

View Article and Find Full Text PDF

Purpose: A quarter of ICU-patients develop post-traumatic stress disorder (PTSD) after discharge. These patients could benefit from early detection of PTSD. Therefore, we explored the accuracy of text mining with self-narratives to identify intensive care unit (ICU) patients and surviving relatives at risk of PTSD in a pilot study.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!