An emerging area in data science that has lately gained attention is the virtual population (VP) and synthetic data generation. This field has the potential to significantly affect the healthcare industry by providing a means to augment clinical research databases that have a shortage of subjects. The current study provides a comparative analysis of five distinct approaches for creating virtual data populations from real patient data. The data set utilized for the current analyses involved clinical data collected among patients scheduled for elective coronary artery bypass graft surgery (CABG). To that end, the five computational techniques employed to augment the given dataset were: (i) Tabular Preset, (ii) Gaussian Copula Model (iii) Generative Adversarial Network based (GAN) Deep Learning data synthesizer (CTGAN), (iv) a variation of the CTGAN Model (Copula GAN), and (v) VAE-based Deep Learning data synthesizer (TVAE). The performance of these techniques was assessed against their effectiveness in producing high-quality virtual data. For this purpose, dataset correlation matrices, cosine similarity distance, density histograms, and kernel density estimation are employed to perform a comparative analysis of each attribute and the respective synthetic equivalent. Our findings demonstrate that Gaussian Copula Model prevails in creating virtual data with consistent distributions (Kolmogorov-Smirnov (KS) and Chi-Squared (CS) tests equal to 0.9 and 0.98, respectively) and correlation patterns (average cosine similarity equals to 0.95).Clinical Relevance- It has been shown that the use of a VP can increase the predictive performance of a ML model, i.e., above using a smaller non-augmented population.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1109/EMBC40787.2023.10340194 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!