Gold standard genomic datasets severely under-represent non-European populations, leading to inequities and a limited understanding of human disease. Therapeutics and outcomes remain hidden because we lack insights that could be gained from analyzing ancestrally diverse genomic data. To address this significant gap, we present PhyloFrame, a machine learning method for equitable genomic precision medicine. PhyloFrame corrects for ancestral bias by integrating functional interaction networks and population genomics data with transcriptomic training data. Application of PhyloFrame to breast, thyroid, and uterine cancers shows marked improvements in predictive power across all ancestries, less model overfitting, and a higher likelihood of identifying known cancer-related genes. Validation in fourteen ancestrally diverse datasets demonstrates that PhyloFrame is better able to adjust for ancestry bias across all populations. The ability to provide accurate predictions for underrepresented groups, in particular, is substantially increased. Analysis of performance in the most diverse continental ancestry group, African, illustrates how phylogenetic distance from training data negatively impacts model performance, as well as PhyloFrame's capacity to mitigate these effects. These results demonstrate how equitable artificial intelligence (AI) approaches can mitigate ancestral bias in training data and contribute to equitable representation in medical research.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1038/s41467-025-57216-8 | DOI Listing |
Nat Commun
March 2025
Department of Computer & Information Science & Engineering, University of Florida, 1889 Museum Rd, Gainesville, 32611, FL, USA.
Gold standard genomic datasets severely under-represent non-European populations, leading to inequities and a limited understanding of human disease. Therapeutics and outcomes remain hidden because we lack insights that could be gained from analyzing ancestrally diverse genomic data. To address this significant gap, we present PhyloFrame, a machine learning method for equitable genomic precision medicine.
View Article and Find Full Text PDFGenetics
March 2025
Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90098, United States of America.
Scalable methods for estimating marginal coalescent trees across the genome present new opportunities for studying evolution and have generated considerable excitement, with new methods extending scalability to thousands of samples. Benchmarking of the available methods has revealed general tradeoffs between accuracy and scalability, but performance in downstream applications has not always been easily predictable from general performance measures, suggesting that specific features of the ARG may be important for specific downstream applications of estimated ARGs. To exemplify this point, we benchmark ARG estimation methods with respect to a specific set of methods for estimating the historical time course of a population-mean polygenic score (PGS) using the marginal coalescent trees encoded by the ancestral recombination graph (ARG).
View Article and Find Full Text PDFbioRxiv
February 2025
Department of Ecology & Evolutionary Biology, University of Toronto.
Spatial patterns of genetic relatedness among samples reflect the past movements of their ancestors. Our ability to untangle this history has the potential to improve dramatically given that we can now infer the ultimate description of genetic relatedness, the ancestral recombination graph (ARG). By extending spatial theory previously applied to trees, we generalize the common model of Brownian motion to full ARGs, thereby accounting for correlations in trees along a chromosome while efficiently computing likelihood-based estimates of dispersal rate and genetic ancestor locations, with associated uncertainties.
View Article and Find Full Text PDFPlant Physiol Biochem
February 2025
Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai, 201602, China.
In plants, basic region/leucine zipper motif (bZIP) transcription factors (TFs) stand as pivotal regulators in a broad spectrum of developmental mechanisms and adaptive strategies against environmental pressures. However, the ancestral origins and the evolutionary progression of their functional diversity across plant species have yet to be thoroughly illuminated. This study delved into the ATB2 subgroup bZIP homologs, tracing them back to the ancestral charophyte lineage predating land plant emergence, and categorized them into four distinct phylogenetic clusters (Clades A to D).
View Article and Find Full Text PDFNat Commun
February 2025
Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden.
Mammalian genomes are biased towards GC bases at third codon positions, likely due to a GC-biased ancestral genome and the selectively neutral recombination-related process of GC-biased gene conversion. The unwanted transcript hypothesis posits that this high GC content at synonymous sites may be beneficial for protecting against spurious transcripts, particularly in species with low effective population sizes. Utilising a 240 placental mammal genome alignment and single-base resolution conservation scores, we interpret sequence conservation at mammalian four-fold degenerate sites in this context and find evidence in support of the unwanted transcript hypothesis, including a strong GC bias, high conservation at sites relating to exon splicing, less human genetic variation at conserved four-fold degenerate sites, and conservation of sites important for epigenetic regulation of developmental genes.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!