Publications by authors named "Tandy Warnow"

We address the problem of how to estimate a phylogenetic network when given single-nucleotide polymorphisms (i.e., SNPs, or bi-allelic markers that have evolved under the infinite sites assumption).

View Article and Find Full Text PDF
Article Synopsis
  • Relationships among avian lineages remain unresolved due to factors like species diversity, phylogenetic methods, and selection of genomic regions.
  • An analysis of 363 bird species' genomes reveals a well-supported evolutionary tree but highlights significant discrepancies among certain groups.
  • Findings suggest that after the Cretaceous-Palaeogene extinction, birds experienced increased population size and diversification, which offers a new foundational understanding for future research in avian evolution.
View Article and Find Full Text PDF

Background: Adding sequences into an existing (possibly user-provided) alignment has multiple applications, including updating a large alignment with new data, adding sequences into a constraint alignment constructed using biological knowledge, or computing alignments in the presence of sequence length heterogeneity. Although this is a natural problem, only a few tools have been developed to use this information with high fidelity.

Results: We present EMMA (Extending Multiple alignments using MAFFT--add) for the problem of adding a set of unaligned sequences into a multiple sequence alignment (i.

View Article and Find Full Text PDF

Background: Species tree estimation is a basic step in many biological research projects, but is complicated by the fact that gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT), which can cause different regions within the genome to have different evolutionary histories (i.e., "gene tree heterogeneity").

View Article and Find Full Text PDF

Motivation: Branch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix.

View Article and Find Full Text PDF

Motivation: Despite advances in method development for multiple sequence alignment over the last several decades, the alignment of datasets exhibiting substantial sequence length heterogeneity, especially when the input sequences include very short sequences (either as a result of sequencing technologies or of large deletions during evolution) remains an inadequately solved problem.

Results: We present HMMerge, a method to compute an alignment of datasets exhibiting high sequence length heterogeneity, or to add short sequences into a given 'backbone' alignment. HMMerge builds on the technique from its predecessor alignment methods, UPP and WITCH, which build an ensemble of profile HMMs to represent the backbone alignment and add the remaining sequences into the backbone alignment using the ensemble.

View Article and Find Full Text PDF

Summary: Multiple sequence alignment is a basic part of many bioinformatics pipelines, including in phylogeny estimation, prediction of structure for both RNAs and proteins, and metagenomic sequence analysis. Yet many sequence datasets exhibit substantial sequence length heterogeneity, both because of large insertions and deletions in the evolutionary history of the sequences and the inclusion of unassembled reads or incompletely assembled sequences in the input. A few methods have been developed that can be highly accurate in aligning datasets with sequence length heterogeneity, with UPP one of the first methods to achieve good accuracy, and WITCH a recent improvement on UPP for accuracy.

View Article and Find Full Text PDF

Summary: Phylogenetic placement is the problem of placing 'query' sequences into an existing tree (called a 'backbone tree'). One of the most accurate phylogenetic placement methods to date is the maximum likelihood-based method pplacer, using RAxML to estimate numeric parameters on the backbone tree and then adding the given query sequence to the edge that maximizes the probability that the resulting tree generates the query sequence. Unfortunately, this way of running pplacer fails to return valid outputs on many moderately large backbone trees and so is limited to backbone trees with at most ∼10 000 leaves.

View Article and Find Full Text PDF

Motivation: Genes evolve under processes such as gene duplication and loss (GDL), so that gene family trees are multi-copy, as well as incomplete lineage sorting (ILS); both processes produce gene trees that differ from the species tree. The estimation of species trees from sets of gene family trees is challenging, and the estimation of rooted species trees presents additional analytical challenges. Two of the methods developed for this problem are STRIDE, which roots species trees by considering GDL events, and Quintet Rooting (QR), which roots species trees by considering ILS.

View Article and Find Full Text PDF

Motivation: Multiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. Ultra-large multiple sequence alignment using Phylogeny-aware Profiles (UPP) is a method for MSA estimation that builds an ensemble of Hidden Markov Models (eHMM) to represent an estimated alignment on the full-length sequences in the input, and then adds the remaining sequences into the alignment using selected HMMs in the ensemble.

View Article and Find Full Text PDF
Article Synopsis
  • The increase in sequence data availability has led biologists to aim for accurate phylogeny estimations for very large datasets, even involving hundreds of thousands of sequences.
  • Constructing these extensive phylogenies involves complex analytical and computational challenges, especially with high quantities of sequences.
  • Recent advancements include innovative methods for multiple sequence alignment, estimating species trees from multi-locus datasets, and integrating new sequences into existing trees, paving the way for future improvements in this field.
View Article and Find Full Text PDF
Article Synopsis
  • MAGUS is a new multiple sequence alignment method that excels in accuracy for large datasets by using a divide-and-conquer approach.
  • It divides sequences into smaller sets, aligns them individually, and merges these alignments with a technique known as Graph Clustering Method (GCM), which is effective for tackling the NP-hard MWT-AM problem.
  • The study shows that GCM can significantly enhance alignment accuracy, and there are improvements to GCM that provide even better results, suggesting a promising direction for large-scale MSA strategies.
View Article and Find Full Text PDF

Motivation: Rooted species trees are a basic model with multiple applications throughout biology, including understanding adaptation, biodiversity, phylogeography and co-evolution. Because most species tree estimation methods produce unrooted trees, methods for rooting these trees have been developed. However, most rooting methods either rely on prior biological knowledge or assume that evolution is close to clock-like, which is not usually the case.

View Article and Find Full Text PDF

Phylogenetic placement, the problem of placing a "query" sequence into a precomputed phylogenetic "backbone" tree, is useful for constructing large trees, performing taxon identification of newly obtained sequences, and other applications. The most accurate current methods, such as pplacer and EPA-ng, are based on maximum likelihood and require that the query sequence be provided within a multiple sequence alignment that includes the leaf sequences in the backbone tree. This approach enables high accuracy but also makes these likelihood-based methods computationally intensive on large backbone trees, and can even lead to them failing when the backbone trees are very large (e.

View Article and Find Full Text PDF

Species tree inference is a basic step in biological discovery, but discordance between gene trees creates analytical challenges and large data sets create computational challenges. Although there is generally some information available about the species trees that could be used to speed up the estimation, only one species tree estimation method that addresses gene tree discordance-ASTRAL-J, a recent development in the ASTRAL family of methods-is able to use this information. Here we describe two new methods, NJst-J and FASTRAL-J, that can estimate the species tree, given a partial knowledge of the species tree in the form of a nonbinary unrooted constraint tree.

View Article and Find Full Text PDF
Article Synopsis
  • Life on Earth has evolved from simple beginnings to complex systems, with bacteria, archaea, and eukaryotes contributing through metabolic and morphological innovations.
  • The Earth BioGenome Project aims to sequence the genomes of all 2 million named eukaryotic species to create a comprehensive digital library of life, enabling deeper understanding of evolution and biodiversity.
  • Sequencing all eukaryotic species will provide essential data to address key questions in phylogenetics, ecology, and conservation, while also enhancing knowledge in agriculture, bioindustry, and medicine.
View Article and Find Full Text PDF

Deep neural networks (DNNs) have been recently proposed for quartet tree phylogeny estimation. Here, we present a study evaluating recently trained DNNs in comparison to a collection of standard phylogeny estimation methods on a heterogeneous collection of datasets simulated under the same models that were used to train the DNNs, and also under similar conditions but with higher rates of evolution. Our study shows that using DNNs with quartet amalgamation is less accurate than several standard phylogeny estimation methods we explore (e.

View Article and Find Full Text PDF
Article Synopsis
  • Multiple sequence alignment (MSA) is crucial in bioinformatics for tasks like phylogeny estimation and protein structure prediction, but it's challenging with fragmented sequences from next-generation sequencing.
  • The study highlights MAGUS, a robust MSA method against fragmentary data, and introduces an improved two-stage approach that integrates MAGUS with ensembles of Hidden Markov Models (eHMMs), enhancing alignment accuracy.
  • The combination of MAGUS and eHMMs outperforms the previous best method, UPP, in aligning highly fragmented datasets, and implementation resources for both methods are available online.
View Article and Find Full Text PDF
Article Synopsis
  • Species tree inference faces challenges due to gene tree heterogeneity caused by gene duplication and loss, making accurate estimation difficult.
  • Current methods addressing this issue often require significant time and memory resources.
  • The new approach, DISCO, decomposes multi-copy gene family trees into single copy trees, improving accuracy in species tree estimation while being more efficient than existing methods.
View Article and Find Full Text PDF
Article Synopsis
  • BAli-Phy is a Bayesian method used for co-estimating sequence alignments and phylogenetic trees, but it typically works best with smaller datasets of about 100 sequences due to its high computational demands.
  • The authors adapt BAli-Phy to use a fixed phylogenetic tree estimated from unaligned sequences, allowing it to achieve higher accuracy than existing methods like Prank and MAFFT, while also being capable of handling larger datasets of up to 1000 sequences.
  • The datasets used for this study are available at the specified URL, and additional supplementary information can be found online at Bioinformatics.
View Article and Find Full Text PDF

One of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a "supertree method".

View Article and Find Full Text PDF
Article Synopsis
  • Scientists need good reference genomes to study biology, diseases, and protect wildlife, but there are only a few for non-microbial species.
  • The Genome 10K (G10K) group worked for five years to improve the way they create these high-quality genomes and gathered information from 16 different animal species.
  • Their work showed that special long-read technology improves genome quality, fixed errors in old genome sequences, and discovered new things about genes and chromosomes, leading to a new project to create complete genomes for about 70,000 vertebrate species.
View Article and Find Full Text PDF

Computer science has experienced dramatic growth and diversification over the last twenty years. Towards a current understanding of the structure of this discipline, we analyze a large sample of the computer science literature from the DBLP database. For insight on the features of this cohort and the relationship within its components, we have constructed article level clusters based on either direct citations or co-citations, and reconciled them with major and minor subject categories in the All Science Journal Classification (ASJC).

View Article and Find Full Text PDF