Numerous studies have shown that a person's health status is closely related to their socioeconomic status. It is evident that incorporating socioeconomic data associated with a patient's geographic area of residence into clinical datasets will promote medical research. However, most socioeconomic variables are unique in combination and are affiliated with small geographical regions (e.g., census tracts) that are often associated with less than 20,000 people. Thus, sharing such tract-level data can violate the Safe Harbor implementation of de-identification under the Health Insurance Portability and Accountability Act of 1996 (HIPAA). In this paper, we introduce a constraint-based k-means clustering approach to generate census tract-level socioeconomic data that is de-identification compliant. Our experimental analysis with data from the American Community Survey illustrates that the approach generates a protected dataset with high similarity to the unaltered values, and achieves a substantially better data utility than the HIPAA Safe Harbor recommendation of 3-digit ZIP code.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8861681 | PMC |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!