Prediction of missing sequences and branch lengths in phylogenomic data.

Bioinformatics

Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, Heidelberg 69118, Germany, Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe 76131, Germany.

Published: May 2016

AI Article Synopsis

  • Missing data in large phylogenomic datasets leads to inaccurate phylogenetic inferences, specifically causing excessively long branch lengths in inferred trees.
  • The study explores algorithms that predict and correct these long branch lengths by imputing missing sequence data, demonstrating improved accuracy in tree branch lengths when compared to datasets with missing information.
  • The implementation and supplementary materials related to the algorithms can be accessed through a dedicated GitHub repository and at Bioinformatics online.

Article Abstract

Motivation: The presence of missing data in large-scale phylogenomic datasets has negative effects on the phylogenetic inference process. One effect that is caused by alignments with missing per-gene or per-partition sequences is that the inferred phylogenies may exhibit extremely long branch lengths. We investigate if statistically predicting missing sequences for organisms by using information from genes/partitions that have data for these organisms alleviates the problem and improves phylogenetic accuracy.

Results: We present several algorithms for correcting excessively long branch lengths induced by missing data. We also present methods for predicting/imputing missing sequence data. We evaluate our algorithms by systematically removing sequence data from three empirical and 100 simulated alignments. We then compare the Maximum Likelihood trees inferred from the gappy alignments and on the alignments with predicted sequence data to the trees inferred from the original, complete datasets. The datasets with predicted sequences showed one to two orders of magnitude more accurate branch lengths compared to the branch lengths of the trees inferred from the alignments with missing data. However, prediction did not affect the RF distances between the trees.

Availability And Implementation: https://github.com/ddarriba/ForeSeqs

Contact: : diego.darriba@h-its.org

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btv768DOI Listing

Publication Analysis

Top Keywords

branch lengths
20
missing data
12
sequence data
12
trees inferred
12
data
9
missing sequences
8
alignments missing
8
long branch
8
missing
6
branch
5

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!