Phylogenomics with paralogs.

Proc Natl Acad Sci U S A

Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center of Bioinformatics, Leipzig University, D-04107 Leipzig, Germany; Max Planck Institute for Mathematics in the Sciences, D-04103 Leipzig, Germany; Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany; Institute for Theoretical Chemistry, University of Vienna, A-1090 Vienna, Austria; Center for Non-Coding RNA in Technology and Health, University of Copenhagen, 1870 Frederiksberg C, Denmark; and Santa Fe Institute, Santa Fe, NM 87501.

Published: February 2015

Phylogenomics heavily relies on well-curated sequence data sets that comprise, for each gene, exclusively 1:1 orthologos. Paralogs are treated as a dangerous nuisance that has to be detected and removed. We show here that this severe restriction of the data sets is not necessary. Building upon recent advances in mathematical phylogenetics, we demonstrate that gene duplications convey meaningful phylogenetic information and allow the inference of plausible phylogenetic trees, provided orthologs and paralogs can be distinguished with a degree of certainty. Starting from tree-free estimates of orthology, cograph editing can sufficiently reduce the noise to find correct event-annotated gene trees. The information of gene trees can then directly be translated into constraints on the species trees. Although the resolution is very poor for individual gene families, we show that genome-wide data sets are sufficient to generate fully resolved phylogenetic trees, even in the presence of horizontal gene transfer.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4343152PMC
http://dx.doi.org/10.1073/pnas.1412770112DOI Listing

Publication Analysis

Top Keywords

data sets
12
phylogenetic trees
8
gene trees
8
gene
6
trees
5
phylogenomics paralogs
4
paralogs phylogenomics
4
phylogenomics heavily
4
heavily relies
4
relies well-curated
4

Similar Publications

Background: This two-stage individual patient data meta-analysis (IPD-MA) compared the efficacy of a shorter duration (≤ 2 days) of vasoactive (VA) drug therapy to standard duration (3-5 days) after acute variceal bleeding (AVB) in patients with liver cirrhosis.

Patients And Methods: Randomized clinical trials on patients with cirrhosis and AVB undergoing endoscopic band ligation which compared a short duration versus the standard duration of VA therapy were included. The primary outcome was 5-day rebleeding rate.

View Article and Find Full Text PDF

Background: The differential impact of serum lipids and their targets for lipid modification on cardiometabolic disease risk is debated. This study used Mendelian randomization to investigate the causal relationships and underlying mechanisms.

Methods: Genetic variants related to lipid profiles and targets for lipid modification were sourced from the Global Lipids Genetics Consortium.

View Article and Find Full Text PDF

Background: Osteoporosis is a common age-related disease with disabling consequences, the early diagnosis of which is difficult due to its long and hidden course, which often leads to diagnosis only after a fracture. In this regard, great expectations are placed on advanced developments in machine learning technologies aimed at predicting osteoporosis at an early stage of development, including the use of large data sets containing information on genetic and clinical predictors of the disease. Nevertheless, the inclusion of DNA markers in prediction models is fraught with a number of difficulties due to the complex polygenic and heterogeneous nature of the disease.

View Article and Find Full Text PDF

Proteins can be rapidly prototyped with cell-free expression (CFE) but in most cases there is a lack of probes or assays to measure their function directly in the cell lysate, thereby limiting the throughput of these screens. Increased throughput is needed to build standardized, sequence to function data sets to feed machine learning guided protein optimization. Herein, we describe the use of fluorescent single-walled carbon nanotubes (SWCNT) as effective probes for measuring protease activity directly in cell-free lysate.

View Article and Find Full Text PDF

Fully Synthetic Data for Complex Surveys.

Surv Methodol

December 2024

Department of Statistical Science, 214a Old Chemistry Building, Duke University, Durham, NC 27708-0251.

When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Our approach adheres to the general strategy proposed by Rubin (1993).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!