De-identification of free text data containing personal health information: a scoping review of reviews.

Int J Popul Data Sci

Manitoba Centre for Health Policy, Department of Community Health Sciences, Rady Faculty of Health Sciences, University of Manitoba.

Published: February 2024

Introduction: Using data in research often requires that the data first be de-identified, particularly in the case of health data, which often include Personal Identifiable Information (PII) and/or Personal Health Identifying Information (PHII). There are established procedures for de-identifying structured data, but de-identifying clinical notes, electronic health records, and other records that include free text data is more complex. Several different ways to achieve this are documented in the literature. This scoping review identifies categories of de-identification methods that can be used for free text data.

Methods: We adopted an established scoping review methodology to examine review articles published up to May 9, 2022, in Ovid MEDLINE; Ovid Embase; Scopus; the ACM Digital Library; IEEE Explore; and Compendex. Our research question was: What methods are used to de-identify free text data? Two independent reviewers conducted title and abstract screening and full-text article screening using the online review management tool Covidence.

Results: The initial literature search retrieved 3,312 articles, most of which focused primarily on structured data. Eighteen publications describing methods of de-identification of free text data met the inclusion criteria for our review. The majority of the included articles focused on removing categories of personal health information identified by the Health Insurance Portability and Accountability Act (HIPAA). The de-identification methods they described combined rule-based methods or machine learning with other strategies such as deep learning.

Conclusion: Our review identifies and categorises de-identification methods for free text data as rule-based methods, machine learning, deep learning and a combination of these and other approaches. Most of the articles we found in our search refer to de-identification methods that target some or all categories of PHII. Our review also highlights how de-identification systems for free text data have evolved over time and points to hybrid approaches as the most promising approach for the future.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10898315PMC
http://dx.doi.org/10.23889/ijpds.v8i1.2153DOI Listing

Publication Analysis

Top Keywords

free text
28
text data
20
de-identification methods
16
personal health
12
scoping review
12
data
10
de-identification free
8
review
8
structured data
8
review identifies
8

Similar Publications

Chest computed tomography (CT) is essential for diagnosing and monitoring thoracic aortic dilations and aneurysms, conditions that place patients at risk of complications such as aortic dissection and rupture. However, aortic measurements in chest CT radiology reports are often embedded in free-text formats, limiting their accessibility for clinical care, quality improvement and research purposes. In this study, we developed a multi-method pipeline to extract structured aortic measurements from radiology reports, and compared the performance of fine-tuned BERT-based models with instruction-tuned Llama large language models (LLMs).

View Article and Find Full Text PDF

Objectives: In the USA, some tobacco companies replaced the marketing phrase '100% natural additive-free tobacco' with 'tobacco ingredients: tobacco & water' (T&W) after receiving warnings from the US Food and Drug Administration. This study assesses how people interpret the now-restricted additive-free claims and newer T&W claims on Natural American Spirit (NAS) and L&M cigarette packs.

Methods: An online between-subjects experiment randomised 2526 US adults to view one of three packs: an NAS additive-free pack, an NAS T&W pack or an L&M T&W pack.

View Article and Find Full Text PDF

Background: Machine learning models can reduce the burden on doctors by converting medical records into International Classification of Diseases (ICD) codes in real time, thereby enhancing the efficiency of diagnosis and treatment. However, it faces challenges such as small datasets, diverse writing styles, unstructured records, and the need for semimanual preprocessing. Existing approaches, such as naive Bayes, Word2Vec, and convolutional neural networks, have limitations in handling missing values and understanding the context of medical texts, leading to a high error rate.

View Article and Find Full Text PDF

Nurse Experiences in an Electronic Health Record Transition: A Mixed Methods Analysis.

Comput Inform Nurs

January 2025

Author Affiliations: Center for the Study of Healthcare Innovation, Implementation & Policy, VA Greater Los Angeles Health Care (Dr Brunner and Ms Amano), CA; Michael E. DeBakey VA Medical Center (Dr Davila), Houston, TX; Department of Medicine-Health Services Research, Baylor College of Medicine (Dr Davila), Houston, TX; VA Ann Arbor Healthcare System (Dr Krein), MI; Division of General Medicine, Department of Internal Medicine, University of Michigan Medical School (Dr Krein), Ann Arbor; Office of Nursing Services, Veterans Health Administration (Dr Sullivan and Ms Church), Washington, DC; Center of Innovation for Veteran-Centered and Value-Driven Care, Seattle VA Medical Center (Dr Sayre), WA; University of Washington School of Public Health (Dr Sayre), Seattle; Center for Healthcare Organization and Implementation Research, VA Bedford Healthcare System (Dr Rinne), MA; and Division of Pulmonary and Critical Care Medicine, Department of Medicine, Geisel School of Medicine, Dartmouth University (Dr Rinne), MA.

Transitions from one EHR to another can be enormously disruptive to care. Nurses are the largest group of EHR users, but nurse experiences with EHR transitions have not been well documented. We sought to understand nurse experiences with an EHR transition at the US Department of Veterans Affairs.

View Article and Find Full Text PDF

Objective: To compare various methods for extracting daily dosage information from prescription signatures (sigs) and identify the best performers.

Materials And Methods: In this study, 5 daily dosage extraction methods were identified. Parsigs, RxSig, Sig2db, a large language model (LLM), and a bidirectional long short-term memory (BiLSTM) model were selected.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!