While often represented as static entities, gene networks are highly context-dependent. Here, we developed a multi-task learning strategy to yield context-specific representations of gene network dynamics. We assembled a corpus comprising ~103 million human single-cell transcriptomes from a broad range of tissues and diseases and performed a two stage pretraining, first with non-malignant cells to generate a foundational model and then with continual learning on cancer cells to tune the model to the cancer domain.
View Article and Find Full Text PDFThe expansion of biobanks has significantly propelled genomic discoveries yet the sheer scale of data within these repositories poses formidable computational hurdles, particularly in handling extensive matrix operations required by prevailing statistical frameworks. In this work, we introduce computational optimizations to the SAIGE (Scalable and Accurate Implementation of Generalized Mixed Model) algorithm, notably employing a GPU-based distributed computing approach to tackle these challenges. We applied these optimizations to conduct a large-scale genome-wide association study (GWAS) across 2,068 phenotypes derived from electronic health records of 635,969 diverse participants from the Veterans Affairs (VA) Million Veteran Program (MVP).
View Article and Find Full Text PDFTraining machine-learning models with synthetically generated data can alleviate the problem of data scarcity when acquiring diverse and sufficiently large datasets is costly and challenging. Here we show that cascaded diffusion models can be used to synthesize realistic whole-slide image tiles from latent representations of RNA-sequencing data from human tumours. Alterations in gene expression affected the composition of cell types in the generated synthetic image tiles, which accurately preserved the distribution of cell types and maintained the cell fraction observed in bulk RNA-sequencing data, as we show for lung adenocarcinoma, kidney renal papillary cell carcinoma, cervical squamous cell carcinoma, colon adenocarcinoma and glioblastoma.
View Article and Find Full Text PDFGenome-wide association studies (GWAS) have underrepresented individuals from non-European populations, impeding progress in characterizing the genetic architecture and consequences of health and disease traits. To address this, we present a population-stratified phenome-wide GWAS followed by a multi-population meta-analysis for 2,068 traits derived from electronic health records of 635,969 participants in the Million Veteran Program (MVP), a longitudinal cohort study of diverse U.S.
View Article and Find Full Text PDFData scarcity presents a significant obstacle in the field of biomedicine, where acquiring diverse and sufficient datasets can be costly and challenging. Synthetic data generation offers a potential solution to this problem by expanding dataset sizes, thereby enabling the training of more robust and generalizable machine learning models. Although previous studies have explored synthetic data generation for cancer diagnosis, they have predominantly focused on single modality settings, such as whole-slide image tiles or RNA-Seq data.
View Article and Find Full Text PDFPhilos Trans A Math Phys Eng Sci
April 2017
Relevant to drivetrain bearing fatigue failures, we analyse non-steady wind turbine responses from interactions between energy-dominant daytime atmospheric turbulence eddies and the rotating blades of a GE 1.5 MW wind turbine using a unique dataset from a GE field experiment and computer simulation. Time-resolved local velocity data were collected at the leading and trailing edges of an instrumented blade together with generator power, revolutions per minute, pitch and yaw.
View Article and Find Full Text PDF