Publications by authors named "Erin Molloy"

Cancer progression is an evolutionary process driven by the selection of cells adapted to gain growth advantage. We present a formal study on the adaptation of gene expression in subclonal evolution. We model evolutionary changes in gene expression as stochastic Ornstein-Uhlenbeck processes, jointly leveraging the evolutionary history of subclones and single-cell expression data.

View Article and Find Full Text PDF
Article Synopsis
  • The study presents detailed genomes of six ape species, achieving high accuracy and complete sequencing of all their chromosomes.
  • It addresses complex genomic regions, leading to enhanced understanding of evolutionary relationships among these species.
  • The findings will serve as a crucial resource for future research on human evolution and our closest ape relatives.
View Article and Find Full Text PDF

The advent of long-read sequencing of microbiomes necessitates the development of new taxonomic profilers tailored to long-read shotgun metagenomic datasets. Here, we introduce Lemur and Magnet, a pair of tools optimized for lightweight and accurate taxonomic profiling for long-read shotgun metagenomic datasets. Lemur is a marker-gene-based method that leverages an EM algorithm to reduce false positive calls while preserving true positives; Magnet is a whole-genome read-mapping-based method that provides detailed presence and absence calls for bacterial genomes.

View Article and Find Full Text PDF
Article Synopsis
  • Cancer progression is an evolutionary process where certain cells adapt to grow better than others, leading to diverse subclones.
  • The study used advanced modeling to analyze how gene expression changes during this subclonal evolution, using data from mouse melanoma cells.
  • Findings showed that different sublines exhibited unique gene expression patterns; resistant sublines adapted genes related to invasion, while responsive sublines focused on proliferation, highlighting non-genetic aspects of cancer evolution.
View Article and Find Full Text PDF

Purpose: String indexes such as the suffix array (SA) and the closely related longest common prefix (LCP) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize.

Methods: In this paper we present CAPS-SA, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort.

View Article and Find Full Text PDF

Most current studies rely on short-read sequencing to detect somatic structural variation (SV) in cancer genomes. Long-read sequencing offers the advantage of better mappability and long-range phasing, which results in substantial improvements in germline SV detection. However, current long-read SV detection methods do not generalize well to the analysis of somatic SVs in tumor genomes with complex rearrangements, heterogeneity, and aneuploidy.

View Article and Find Full Text PDF

Efficiency of mosquito-borne disease transmission is dependent upon both the preference and fidelity of mosquitoes as they seek the blood of vertebrate hosts. While mosquitoes select their blood hosts through multi-modal integration of sensory cues, host-seeking is primarily an odor-guided behavior. Differences in mosquito responses to hosts and their odors have been demonstrated to have a genetic component, but the underlying genomic architecture of these responses has yet to be fully resolved.

View Article and Find Full Text PDF

The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set.

View Article and Find Full Text PDF

Cancer progression and treatment can be informed by reconstructing its evolutionary history from tumor cells. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are error-free and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because single-cell sequencing produces sparse, error-ridden data and because tumors evolve clonally.

View Article and Find Full Text PDF

Recent studies exploring the impact of methylation in tumor evolution suggest that although the methylation status of many of the CpG sites are preserved across distinct lineages, others are altered as the cancer progresses. Because changes in methylation status of a CpG site may be retained in mitosis, they could be used to infer the progression history of a tumor via single-cell lineage tree reconstruction. In this work, we introduce the first principled distance-based computational method, Sgootr, for inferring a tumor's single-cell methylation lineage tree and for jointly identifying lineage-informative CpG sites that harbor changes in methylation status that are retained along the lineage.

View Article and Find Full Text PDF

methods are widely used to estimate species trees from genome-scale data. However, they can fail to produce accurate species trees when the input gene trees are highly discordant because of estimation error and biological processes, such as incomplete lineage sorting. Here, we introduce TREE-QMC, a new summary method that offers accuracy and scalability under these challenging scenarios.

View Article and Find Full Text PDF

Background: The Salmonella enterica serovar Newport red onion outbreak of 2020 was the largest foodborne outbreak of Salmonella in over a decade. The epidemiological investigation suggested two farms as the likely source of contamination. However, single nucleotide polymorphism (SNP) analysis of the whole genome sequencing data showed that none of the Salmonella isolates collected from the farm regions were linked to the clinical isolates-preventing the use of phylogenetics in source identification.

View Article and Find Full Text PDF

Inferring the structure of human populations from genetic variation data is a key task in population and medical genomic studies. Although a number of methods for population structure inference have been proposed, current methods are impractical to run on biobank-scale genomic datasets containing millions of individuals and genetic variants. We introduce SCOPE, a method for population structure inference that is orders of magnitude faster than existing methods while achieving comparable accuracy.

View Article and Find Full Text PDF

A potential shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting. Coalescent methods address this problem but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescent methods, retroelement insertions (RIs) have emerged as powerful phylogenomic markers for species tree estimation.

View Article and Find Full Text PDF

Motivation: Admixture, the interbreeding between previously distinct populations, is a pervasive force in evolution. The evolutionary history of populations in the presence of admixture can be modeled by augmenting phylogenetic trees with additional nodes that represent admixture events. While enabling a more faithful representation of evolutionary history, admixture graphs present formidable inferential challenges, and there is an increasing need for methods that are accurate, fully automated and computationally efficient.

View Article and Find Full Text PDF

One of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a "supertree method".

View Article and Find Full Text PDF

Motivation: Metagenomics has revolutionized microbiome research by enabling researchers to characterize the composition of complex microbial communities. Taxonomic profiling is one of the critical steps in metagenomic analyses. Marker genes, which are single-copy and universally found across Bacteria and Archaea, can provide accurate estimates of taxon abundances in the sample.

View Article and Find Full Text PDF

Phylogenomics-the estimation of species trees from multilocus data sets-is a common step in many biological studies. However, this estimation is challenged by the fact that genes can evolve under processes, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that make their trees different from the species tree. In this article, we address the challenge of estimating the species tree under GDL.

View Article and Find Full Text PDF

Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss.

View Article and Find Full Text PDF

Incremental tree building (INC) is a new phylogeny estimation method that has been proven to be absolute fast converging under standard sequence evolution models. A variant of INC, called Constrained-INC, is designed for use in divide-and-conquer pipelines for phylogeny estimation where a set of species is divided into disjoint subsets, trees are computed on the subsets using a selected base method, and then the subset trees are combined together. We evaluate the accuracy of INC and Constrained-INC for gene tree and species tree estimation on simulated datasets, and compare it to similar pipelines using NJMerge (another method that merges disjoint trees).

View Article and Find Full Text PDF

Motivation: Species tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed.

Results: We present FastMulRFS, a polynomial-time method for estimating species trees without knowledge of orthology.

View Article and Find Full Text PDF

After publication of [1], the authors were informed by John A. Rhodes of a counterexample to Theorem 11 of [1].

View Article and Find Full Text PDF

Motivation: Estimated gene trees are often inaccurate, due to insufficient phylogenetic signal in the single gene alignment, among other causes. Gene tree correction aims to improve the accuracy of an estimated gene tree by using computational techniques along with auxiliary information, such as a reference species tree or sequencing data. However, gene trees and species trees can differ as a result of gene duplication and loss (GDL), incomplete lineage sorting (ILS), and other biological processes.

View Article and Find Full Text PDF

DNA sequence alignments have provided the majority of data for inferring phylogenetic relationships with both concatenation and coalescent methods. However, DNA sequences are susceptible to extensive homoplasy, especially for deep divergences in the Tree of Life. Retroelement insertions have emerged as a powerful alternative to sequences for deciphering evolutionary relationships because these data are nearly homoplasy-free.

View Article and Find Full Text PDF