Motivation: Gene trees often differ from the species trees that contain them due to various factors, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL). Several highly accurate species tree estimation methods have been introduced to explicitly address ILS, including ASTRAL, a widely used statistically consistent method, and wQFM, a quartet amalgamation approach experimentally shown to be more accurate than ASTRAL. Two recent advancements, ASTRAL-Pro and DISCO, have emerged in phylogenomics to consider GDL.
View Article and Find Full Text PDFA terrace in a phylogenetic tree space is a region where all trees contain the same set of subtrees, due to certain patterns of missing data among the taxa sampled, resulting in an identical optimality score for a given data set. This was first investigated in the context of phylogenetic tree estimation from sequence alignments using maximum likelihood (ML) and maximum parsimony (MP). It was later extended to the species tree inference problem from a collection of gene trees, where a set of equally optimal species trees was referred to as a "pseudo" species tree terrace which does not consider the topological proximity of the trees in terms of the induced subtrees resulting from certain patterns of missing data.
View Article and Find Full Text PDFMotivation: Proteins are responsible for most biological functions, many of which require the interaction of more than one protein molecule. However, accurately predicting protein-protein interaction (PPI) sites (the interfacial residues of a protein that interact with other protein molecules) remains a challenge. The growing demand and cost associated with the reliable identification of PPI sites using conventional experimental methods call for computational tools for automated prediction and understanding of PPIs.
View Article and Find Full Text PDFRice genetic diversity is regulated by multiple genes and is largely dependent on various environmental factors. Uncovering the genetic variations associated with the diversity in rice populations is the key to breed stable and high yielding rice varieties. We performed genome wide association studies (GWASs) on seven rice yielding traits (grain length, grain width, grain weight, panicle length, leaf length, leaf width, and leaf angle) based on a population of 183 rice landraces of Bangladesh.
View Article and Find Full Text PDFMotivation: Analyzing large-scale single-cell transcriptomic datasets generated using different technologies is challenging due to the presence of batch-specific systematic variations known as batch effects. Since biological and technological differences are often interspersed, detecting and accounting for batch effects in RNA-seq datasets are critical for effective data integration and interpretation. Low-dimensional embeddings, such as principal component analysis (PCA) are widely used in visual inspection and estimation of batch effects.
View Article and Find Full Text PDFMotivation: With the recent breakthroughs in sequencing technology, phylogeny estimation at a larger scale has become a huge opportunity. For accurate estimation of large-scale phylogeny, substantial endeavor is being devoted in introducing new algorithms or upgrading current approaches. In this work, we endeavor to improve the Quartet Fiduccia and Mattheyses (QFM) algorithm to resolve phylogenetic trees of better quality with better running time.
View Article and Find Full Text PDFMotivation: Protein structure provides insight into how proteins interact with one another as well as their functions in living organisms. Protein backbone torsion angles ( and ) prediction is a key sub-problem in predicting protein structures. However, reliable determination of backbone torsion angles using conventional experimental methods is slow and expensive.
View Article and Find Full Text PDFThe inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the 'language of life', has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing.
View Article and Find Full Text PDFEstimating species trees from multiple genes is complicated and challenging due to . One of the basic approaches to understanding differences between gene trees and species trees is gene duplication and loss events. Minimize Gene Duplication and Loss (MGDL) is a popular technique for inferring species trees from gene trees when the gene trees are discordant due to gene duplications and losses.
View Article and Find Full Text PDFSpecies tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference.
View Article and Find Full Text PDFPhylogenetic identification of unknown sequences by placing them on a tree is routinely attempted in modern ecological studies. Such placements are often obtained from incomplete and noisy data, making it essential to augment the results with some notion of uncertainty. While the standard likelihood-based methods designed for placement naturally provide such measures of uncertainty, the newer and more scalable distance-based methods lack this crucial feature.
View Article and Find Full Text PDFUnlabelled: While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy.
View Article and Find Full Text PDFBackground: High-throughput experimental technologies are generating tremendous amounts of genomic data, offering valuable resources to answer important questions and extract biological insights. Storing this sheer amount of genomic data has become a major concern in bioinformatics. General purpose compression techniques (e.
View Article and Find Full Text PDFMultiple sequence alignment (MSA) is a prerequisite for several analyses in bioinformatics, such as, phylogeny estimation, protein structure prediction, etc. PASTA (Practical Alignments using SATé and TrAnsitivity) is a state-of-the-art method for computing MSAs, well-known for its accuracy and scalability. It iteratively co-estimates both MSA and maximum likelihood (ML) phylogenetic tree.
View Article and Find Full Text PDFMotivation: Protein-protein interactions (PPIs) are central to most biological processes. However, reliable identification of PPI sites using conventional experimental methods is slow and expensive. Therefore, great efforts are being put into computational methods to identify PPI sites.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
April 2023
Multiple sequence alignment has been the traditional and well established approach of sequence analysis and comparison, though it is time and memory consuming. As the scale of sequencing data is increasing day by day, the importance of faster yet accurate alignment-free methods is on the rise. Several alignment-free sequence analysis methods have been established in the literature in recent years, which extract numerical features from genomic data to analyze sequences and also to estimate phylogenetic relationship among genes and species.
View Article and Find Full Text PDFMotivation: Species tree estimation from genes sampled from throughout the whole genome is complicated due to the gene tree-species tree discordance. Incomplete lineage sorting (ILS) is one of the most frequent causes for this discordance, where alleles can coexist in populations for periods that may span several speciation events. Quartet-based summary methods for estimating species trees from a collection of gene trees are becoming popular due to their high accuracy and statistical guarantee under ILS.
View Article and Find Full Text PDFBackground: Genomic Islands (GIs) are clusters of genes that are mobilized through horizontal gene transfer. GIs play a pivotal role in bacterial evolution as a mechanism of diversification and adaptation to different niches. Therefore, identification and characterization of GIs in bacterial genomes is important for understanding bacterial evolution.
View Article and Find Full Text PDFTime series gene expression data is widely used to study different dynamic biological processes. Although gene expression datasets share many of the characteristics of time series data from other domains, most of the analyses in this field do not fully leverage the time-ordered nature of the data and focus on clustering the genes based on their expression values. Other domains, such as financial stock and weather prediction, utilize time series data for forecasting purposes.
View Article and Find Full Text PDFMultiple sequence alignment (MSA) is a preliminary task for estimating phylogenies. It is used for homology inference among the sequences of a set of species. Generally, the MSA task is handled as a single-objective optimization process.
View Article and Find Full Text PDFBackground: With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data.
View Article and Find Full Text PDFMotivation: Protein structures provide basic insight into how they can interact with other proteins, their functions and biological roles in an organism. Experimental methods (e.g.
View Article and Find Full Text PDFBackground: Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, estimating a species tree from a collection of gene trees can be complicated due to the presence of gene tree incongruence resulting from incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent process. Maximum likelihood and Bayesian MCMC methods can potentially result in accurate trees, but they do not scale well to large datasets.
View Article and Find Full Text PDFUnderstanding cell differentiation-the process of generation of distinct cell-types-plays a pivotal role in developmental and evolutionary biology. Transcriptomic information and epigenetic marks are useful to elucidate hierarchical developmental relationships among cell-types. Standard phylogenetic approaches such as maximum parsimony, maximum likelihood and neighbor joining have previously been applied to ChIP-Seq histone modification data to infer cell-type trees, showing how diverse types of cells are related.
View Article and Find Full Text PDFAlgorithms Mol Biol
January 2018
Motivation: Species tree estimation from gene trees can be complicated by gene duplication and loss, and "gene tree parsimony" (GTP) is one approach for estimating species trees from multiple gene trees. In its standard formulation, the objective is to find a species tree that minimizes the total number of gene duplications and losses with respect to the input set of gene trees. Although much is known about GTP, little is known about how to treat inputs containing some (i.
View Article and Find Full Text PDF