Validation Assessment of Privacy-Preserving Synthetic Electronic Health Record Data: Comparison of Original Versus Synthetic Data on Real-World COVID-19 Vaccine Effectiveness.

Echo Wang Katrina Mott Hongtao Zhang Sivan Gazit Gabriel Chodick Mehmet Burcu

Pharmacoepidemiol Drug Saf

Epidemiology, Merck & Co., Inc., Rahway, New Jersey, USA.

Published: October 2024

Purpose: To assess the validity of privacy-preserving synthetic data by comparing results from synthetic versus original EHR data analysis.

Methods: A published retrospective cohort study on real-world effectiveness of COVID-19 vaccines by Maccabi Healthcare Services in Israel was replicated using synthetic data generated from the same source, and the results were compared between synthetic versus original datasets. The endpoints included COVID-19 infection, symptomatic COVID-19 infection and hospitalization due to infection and were also assessed in several demographic and clinical subgroups. In comparing synthetic versus original data estimates, several metrices were utilized: standardized mean differences (SMD), decision agreement, estimate agreement, confidence interval overlap, and Wald test. Synthetic data were generated five times to assess the stability of results.

Results: The distribution of demographic and clinical characteristics demonstrated very small difference (< 0.01 SMD). In the comparison of vaccine effectiveness assessed in relative risk reduction between synthetic versus original data, there was a 100% decision agreement, 100% estimate agreement, and a high level of confidence interval overlap (88.7%-99.7%) in all five replicates across all subgroups. Similar findings were achieved in the assessment of vaccine effectiveness against symptomatic COVID-19 Infection. In the comparison of hazard ratios for COVID 19-related hospitalization and odds ratio for symptomatic COVID-19 Infection, the Wald tests suggested no significant difference between respective effect estimates in all five replicates for all patient subgroups but there were disagreements in estimate and decision metrices in some subgroups and replicates.

Conclusions: Overall, comparison of synthetic versus original real-world data demonstrated good validity and reliability. Transparency on the process to generate high fidelity synthetic data and assurances of patient privacy are warranted.

Download full-text PDF	Source
http://dx.doi.org/10.1002/pds.70019	DOI Listing

Publication Analysis

Top Keywords

synthetic data

synthetic versus

versus original

synthetic

privacy-preserving synthetic

comparing synthetic

data generated

covid-19 infection

demographic clinical

data

Similar Publications

Omics-driven onboarding of the carotenoid producing red yeast Xanthophyllomyces dendrorhous CBS 6938.

Appl Microbiol Biotechnol

December 2024

Life Sciences and Bioengineering Center, Department of Chemical Engineering, Worcester Polytechnic Institute, Worcester, MA, USA.

Emma E Tobin Joseph H Collins Celeste B Marsan Gillian T Nadeau Kim Mori

Transcriptomics is a powerful approach for functional genomics and systems biology, yet it can also be used for genetic part discovery. Here, we derive constitutive and light-regulated promoters directly from transcriptomics data of the basidiomycete red yeast Xanthophyllomyces dendrorhous CBS 6938 (anamorph Phaffia rhodozyma) and use these promoters with other genetic elements to create a modular synthetic biology parts collection for this organism. X.

View Article and Find Full Text PDF

Similar Publications

Machine Learning Boosted Entropy-Engineered Synthesis of CuCo Nanometric Solid Solution Alloys for Near-100% Nitrate-to-Ammonia Selectivity.

ACS Appl Mater Interfaces

December 2024

Key Laboratory of Synthetic and Biological Colloids, Ministry of Education, School of Chemical and Material Engineering, Jiangnan University, 214122 Jiangsu, China.

Yao Hu Bo Hu Haihui Lan Jiaxuan Gong Renjing Hu

Nanometric solid solution alloys are utilized in a broad range of fields, including catalysis, energy storage, medical application, and sensor technology. Unfortunately, the synthesis of these alloys becomes increasingly challenging as the disparity between the metal elements grows, due to differences in atomic sizes, melting points, and chemical affinities. This study utilized a data-driven approach incorporating sample balancing enhancement techniques and multilayer perceptron (MLP) algorithms to improve the model's ability to handle imbalanced data, significantly boosting the efficiency of experimental parameter optimization.

View Article and Find Full Text PDF

Similar Publications

Synthetic photoplethysmogram (PPG) signal generation using a genetic programming-based generative model.

J Med Eng Technol

December 2024

Department of Computer Engineering and Information Technology, Razi University, Kermanshah, Iran.

Fatemeh Ghasemi Majid Sepahvand Maytham N Meqdad Fardin Abdali Mohammadi

Nowadays, photoplethysmograph (PPG) technology is being used more often in smart devices and mobile phones due to advancements in information and communication technology in the health field, particularly in monitoring cardiac activities. Developing generative models to generate synthetic PPG signals requires overcoming challenges like data diversity and limited data available for training deep learning models. This paper proposes a generative model by adopting a genetic programming (GP) approach to generate increasingly diversified and accurate data using an initial PPG signal sample.

View Article and Find Full Text PDF

Similar Publications

Advanced metabolic Engineering strategies for the sustainable production of free fatty acids and their derivatives using yeast.

J Biol Eng

December 2024

Department of Chemical Engineering (BK21 FOUR Integrated Engineering), Kyung Hee University, Yongin-si, Gyeonggi-do, 17104, Republic of Korea.

Tisa Rani Saha Nam Kyu Kang Eun Yeol Lee

The biological production of lipids presents a sustainable method for generating fuels and chemicals. Recognized as safe and enhanced by advanced synthetic biology and metabolic engineering tools, yeasts are becoming versatile hosts for industrial applications. However, lipids accumulate predominantly as triacylglycerides in yeasts, which are suboptimal for industrial uses.

View Article and Find Full Text PDF

Similar Publications

AEGAN-Pathifier: a data augmentation method to improve cancer classification for imbalanced gene expression data.

BMC Bioinformatics

December 2024

School of Computer Engineering, Jiangsu Ocean University, Lianyungang, 222005, China.

Qiaosheng Zhang Yalong Wei Jie Hou Hongpeng Li Zhaoman Zhong

Background: Cancer classification has consistently been a challenging problem, with the main difficulties being high-dimensional data and the collection of patient samples. Concretely, obtaining patient samples is a costly and resource-intensive process, and imbalances often exist between samples. Moreover, expression data is characterized by high dimensionality, small samples and high noise, which could easily lead to struggles such as dimensionality catastrophe and overfitting.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!