Methods for rapidly inferring the evolutionary history of species or populations with genome-wide data are progressing, but computational constraints still limit our abilities in this area. We developed an alignment-free method to infer genome-wide phylogenies and implemented it in the Python package T opic C ontml . The method uses probabilistic topic modeling (specifically, Latent Dirichlet Allocation or LDA) to extract 'topic' frequencies from -mers, which are derived from multilocus DNA sequences. These extracted frequencies then serve as an input for the program C ontml in the PHYLIP package, which is used to generate a species tree. We evaluated the performance of T opic C ontml on simulated datasets with gaps and three biological datasets: (1) 14 DNA sequence loci from two Australian bird species distributed across nine populations, (2) 5162 loci from 80 mammal species, and (3) raw, unaligned, non-orthologous P ac B io sequences from 12 bird species. Our empirical results and simulated data suggest that our method is efficient and statistically robust. We also assessed the uncertainty of the estimated relationships among clades using a bootstrap procedure.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11601389 | PMC |
http://dx.doi.org/10.1101/2023.12.20.572577 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!