Equitable machine learning counteracts ancestral bias in precision medicine.

Nat Commun

Department of Computer & Information Science & Engineering, University of Florida, 1889 Museum Rd, Gainesville, 32611, FL, USA.

Published: March 2025

Gold standard genomic datasets severely under-represent non-European populations, leading to inequities and a limited understanding of human disease. Therapeutics and outcomes remain hidden because we lack insights that could be gained from analyzing ancestrally diverse genomic data. To address this significant gap, we present PhyloFrame, a machine learning method for equitable genomic precision medicine. PhyloFrame corrects for ancestral bias by integrating functional interaction networks and population genomics data with transcriptomic training data. Application of PhyloFrame to breast, thyroid, and uterine cancers shows marked improvements in predictive power across all ancestries, less model overfitting, and a higher likelihood of identifying known cancer-related genes. Validation in fourteen ancestrally diverse datasets demonstrates that PhyloFrame is better able to adjust for ancestry bias across all populations. The ability to provide accurate predictions for underrepresented groups, in particular, is substantially increased. Analysis of performance in the most diverse continental ancestry group, African, illustrates how phylogenetic distance from training data negatively impacts model performance, as well as PhyloFrame's capacity to mitigate these effects. These results demonstrate how equitable artificial intelligence (AI) approaches can mitigate ancestral bias in training data and contribute to equitable representation in medical research.

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-025-57216-8DOI Listing

Publication Analysis

Top Keywords

ancestral bias
12
training data
12
machine learning
8
precision medicine
8
ancestrally diverse
8
data
5
equitable
4
equitable machine
4
learning counteracts
4
counteracts ancestral
4

Similar Publications

Equitable machine learning counteracts ancestral bias in precision medicine.

Nat Commun

March 2025

Department of Computer & Information Science & Engineering, University of Florida, 1889 Museum Rd, Gainesville, 32611, FL, USA.

Gold standard genomic datasets severely under-represent non-European populations, leading to inequities and a limited understanding of human disease. Therapeutics and outcomes remain hidden because we lack insights that could be gained from analyzing ancestrally diverse genomic data. To address this significant gap, we present PhyloFrame, a machine learning method for equitable genomic precision medicine.

View Article and Find Full Text PDF

Evaluating ARG-estimation methods in the context of estimating population-mean polygenic score histories.

Genetics

March 2025

Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90098, United States of America.

Scalable methods for estimating marginal coalescent trees across the genome present new opportunities for studying evolution and have generated considerable excitement, with new methods extending scalability to thousands of samples. Benchmarking of the available methods has revealed general tradeoffs between accuracy and scalability, but performance in downstream applications has not always been easily predictable from general performance measures, suggesting that specific features of the ARG may be important for specific downstream applications of estimated ARGs. To exemplify this point, we benchmark ARG estimation methods with respect to a specific set of methods for estimating the historical time course of a population-mean polygenic score (PGS) using the marginal coalescent trees encoded by the ancestral recombination graph (ARG).

View Article and Find Full Text PDF

Spatial patterns of genetic relatedness among samples reflect the past movements of their ancestors. Our ability to untangle this history has the potential to improve dramatically given that we can now infer the ultimate description of genetic relatedness, the ancestral recombination graph (ARG). By extending spatial theory previously applied to trees, we generalize the common model of Brownian motion to full ARGs, thereby accounting for correlations in trees along a chromosome while efficiently computing likelihood-based estimates of dispersal rate and genetic ancestor locations, with associated uncertainties.

View Article and Find Full Text PDF

Unraveling the evolution of the ATB2 subgroup basic leucine zipper transcription factors in plants and decoding the positive effects of BdibZIP44 and BdibZIP53 on heat stress in Brachypodium distachyon.

Plant Physiol Biochem

February 2025

Shanghai Key Laboratory of Plant Functional Genomics and Resources, Shanghai Chenshan Botanical Garden, Shanghai, 201602, China.

In plants, basic region/leucine zipper motif (bZIP) transcription factors (TFs) stand as pivotal regulators in a broad spectrum of developmental mechanisms and adaptive strategies against environmental pressures. However, the ancestral origins and the evolutionary progression of their functional diversity across plant species have yet to be thoroughly illuminated. This study delved into the ATB2 subgroup bZIP homologs, tracing them back to the ancestral charophyte lineage predating land plant emergence, and categorized them into four distinct phylogenetic clusters (Clades A to D).

View Article and Find Full Text PDF

Mammalian genomes are biased towards GC bases at third codon positions, likely due to a GC-biased ancestral genome and the selectively neutral recombination-related process of GC-biased gene conversion. The unwanted transcript hypothesis posits that this high GC content at synonymous sites may be beneficial for protecting against spurious transcripts, particularly in species with low effective population sizes. Utilising a 240 placental mammal genome alignment and single-base resolution conservation scores, we interpret sequence conservation at mammalian four-fold degenerate sites in this context and find evidence in support of the unwanted transcript hypothesis, including a strong GC bias, high conservation at sites relating to exon splicing, less human genetic variation at conserved four-fold degenerate sites, and conservation of sites important for epigenetic regulation of developmental genes.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!