PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data.

BMC Genomics

INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Rua Alves Redol 9, Lisboa, 1000-029, Portugal.

Published: May 2022

Background: In the pursuit of a better understanding of biodiversity, evolutionary biologists rely on the study of phylogenetic relationships to illustrate the course of evolution. The relationships among natural organisms, depicted in the shape of phylogenetic trees, not only help to understand evolutionary history but also have a wide range of additional applications in science. One of the most challenging problems that arise when building phylogenetic trees is the presence of missing biological data. More specifically, the possibility of inferring wrong phylogenetic trees increases proportionally to the amount of missing values in the input data. Although there are methods proposed to deal with this issue, their applicability and accuracy is often restricted by different constraints.

Results: We propose a framework, called PhyloMissForest, to impute missing entries in phylogenetic distance matrices and infer accurate evolutionary relationships. PhyloMissForest is built upon a random forest structure that infers the missing entries of the input data, based on the known parts of it. PhyloMissForest contributes with a robust and configurable framework that incorporates multiple search strategies and machine learning, complemented by phylogenetic techniques, to provide a more accurate inference of lost phylogenetic distances. We evaluate our framework by examining three real-world datasets, two DNA-based sequence alignments and one containing amino acid data, and two additional instances with simulated DNA data. Moreover, we follow a design of experiments methodology to define the hyperparameter values of our algorithm, which is a concise method, preferable in comparison to the well-known exhaustive parameters search. By varying the percentages of missing data from 5% to 60%, we generally outperform the state-of-the-art alternative imputation techniques in the tests conducted on real DNA data. In addition, significant improvements in execution time are observed for the amino acid instance. The results observed on simulated data also denote the attainment of improved imputations when dealing with large percentages of missing data.

Conclusions: By merging multiple search strategies, machine learning, and phylogenetic techniques, PhyloMissForest provides a highly customizable and robust framework for phylogenetic missing data imputation, with significant topological accuracy and effective speedups over the state of the art.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9116704PMC
http://dx.doi.org/10.1186/s12864-022-08540-6DOI Listing

Publication Analysis

Top Keywords

phylogenetic trees
16
missing data
12
phylogenetic
10
data
10
random forest
8
missing
8
input data
8
missing entries
8
multiple search
8
search strategies
8

Similar Publications

Using -rhizobia- interaction networks, we address first the soil invasion success of , and second, we report either -rhizobia partnership should form an isolated module within the symbiosis interaction network. Different indexes were used to determine model invasion success and the network topology. Our results indicated that invasion decreased soil microbial biomass, basal respiration, and enzymatic activities.

View Article and Find Full Text PDF

There is limited information on the occurrence of and ticks, as well as associated and species in Pakistan. Addressing this knowledge gap, the current study aimed at morphomolecular confirmation of these ticks and molecular assessment of associated Rickettsiales bacteria (, and spp.) in Balochistan, Pakistan.

View Article and Find Full Text PDF

Assembly and comparative analysis of the complete mitogenome of var. , an exceptional berry plant possessing sweet leaves.

Front Plant Sci

December 2024

Zhejiang Provincial Key Laboratory of Plant Evolutionary Ecology and Conservation, College of Life Sciences, Taizhou University, Taizhou, China.

var. is a special berry plant of in the Rosaceae family. Its leaves contain high-sweetness, low-calorie, and non-toxic sweet ingredients, known as rubusoside.

View Article and Find Full Text PDF

Unlabelled: The reflexive translation of symbols in one chemical language to another defined genetics. Yet, the co-linearity of codons and amino acids is so commonplace an idea that few even ask how it arose. Readout is done by two distinct sets of proteins, called aminoacyl-tRNA synthetases (AARS).

View Article and Find Full Text PDF

Peach-associated luteovirus (PaLV) belongs to the genus Luteovirus, family Tombusviridae. To date, PaLV has only been reported in peach (Prunus persica) and its presence detected in the Republic of Georgia (Wu et al., 2017), China (Zhou et al.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!