Background: Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information.
Objective: This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships.
Methods: The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)-based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models.
Results: The synthetic data of the 3 diseases (non-small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better.
Conclusions: This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10709788 | PMC |
http://dx.doi.org/10.2196/47859 | DOI Listing |
NPJ Digit Med
January 2025
Biomedical Data Science Center, Centre Hospitalier Universitaire Vaudois, Lausanne, Switzerland.
The use of synthetic data is a promising solution to facilitate the sharing and reuse of health-related data beyond its initial collection while addressing privacy concerns. However, there is still no consensus on a standardized approach for systematically evaluating the privacy and utility of synthetic data, impeding its broader adoption. In this work, we present a comprehensive review and systematization of current methods for evaluating synthetic health-related data, focusing on both privacy and utility aspects.
View Article and Find Full Text PDFNat Commun
January 2025
Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA, USA.
Recent barcoding technologies allow reconstructing lineage trees while capturing paired single-cell RNA-sequencing (scRNA-seq) data. Such datasets provide opportunities to compare gene expression memory maintenance through lineage branching and pinpoint critical genes in these processes. Here we develop Permutation, Optimization, and Representation learning based single Cell gene Expression and Lineage ANalysis (PORCELAN) to identify lineage-informative genes or subtrees where lineage and expression are tightly coupled.
View Article and Find Full Text PDFPhys Med Biol
January 2025
School of Biomedical Engineering, ShanghaiTech University, No. 1 Zhongke Road, Pudong New Area, Shanghai, Shanghai, 201210, CHINA.
Objective: This study aims to propose a dual-domain network that not only reduces scatter artifacts but also retains structure details in CBCT.
Approach: The proposed network comprises a projection-domain sub-network and an image-domain sub-network. The projection-domain sub-network utilizes a division residual network to amplify the difference between scatter signals and imaging signals, facilitating the learning of scatter signals.
Chem Biodivers
January 2025
Saigon University, Institute of Environment-Energy Technology, 273 An Duong Vuong Street, Ho Chi Minh City, Ho Chi Minh City, VIET NAM.
The chemical investigation of the fruits of Garcinia schomburgkiana growing in Vietnam led to the isolation of a new anofinic acid derivative, 5-hydroxy-8-methoxyanofinic acid (1), a new xanthone, xanthoschome C (2), and a known synthetic phenolic analogue, 4-(2-hydroxybenzyl)-2-(4-hydroxybenzyl) phenol (3), along with seven known xanthones (4-10). The structures of all isolated compounds were determined using spectroscopic techniques (NMR and MS), in conjunction with comparison to existing literature data. All isolated compounds were assessed for their α-glucosidase inhibitory activity and showed significant inhibition, with IC50 values ranging from 12.
View Article and Find Full Text PDFNucleic Acids Res
January 2025
Ophthalmology, University of North Carolina, 130 Mason Farm Rd, Chapel Hill, NC 27517, USA.
Adeno-associated virus (AAV) inverted terminal repeats (ITRs) induce p53-dependent apoptosis in human embryonic stem cells (hESCs). To interrogate this phenomenon, a synthetic ITR (SynITR), harboring substitutions in putative p53 binding sites was generated and evaluated for vector production and gene delivery. Replication of SynITR flanked transgenic genome was similar compared to wild type (wt) ITR, with a modest increase in vector titers.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!