Publications by authors named "Dan Gusfield"

Background: The basic RNA secondary structure prediction problem or single sequence folding problem (SSF) was solved 35 years ago by a now well-known [Formula: see text]-time dynamic programming method. Recently three methodologies-Valiant, Four-Russians, and Sparsification-have been applied to speedup RNA secondary structure prediction. The sparsification method exploits two properties of the input: the number of subsequence Z with the endpoints belonging to the optimal folding set and the maximum number base-pairs L.

View Article and Find Full Text PDF

Evolutionary data has been traditionally modeled via phylogenetic trees; however, branching alone cannot model conflicting phylogenetic signals, so networks are used instead. Ancestral recombination graphs (ARGs) are used to model the evolution of incompatible sets of SNP data, allowing each site to mutate only once. The model often aims to minimize the number of recombinations.

View Article and Find Full Text PDF

Background: The secondary structure that maximizes the number of non-crossing matchings between complimentary bases of an RNA sequence of length n can be computed in O(n3) time using Nussinov's dynamic programming algorithm. The Four-Russians method is a technique that reduces the running time for certain dynamic programming algorithms by a multiplicative factor after a preprocessing step where solutions to all smaller subproblems of a fixed size are exhaustively enumerated and solved. Frid and Gusfield designed an O(n3logn) algorithm for RNA folding using the Four-Russians technique.

View Article and Find Full Text PDF

: In this paper, we study the problem of constructing perfect phylogenies for three-state characters. Our work builds on two recent results. The first result states that for three-state characters, the local condition of examining all subsets of three characters is sufficient to determine the global property of admitting a perfect phylogeny.

View Article and Find Full Text PDF

The multistate perfect phylogeny problem is a classic problem in computational biology. When no perfect phylogeny exists, it is of interest to find a set of characters to remove in order to obtain a perfect phylogeny in the remaining data. This is known as the character removal problem.

View Article and Find Full Text PDF

A tanglegram is a pair of trees on the same set of leaves with matching leaves in the two trees joined by an edge. Tanglegrams are widely used in biology--to compare evolutionary histories of host and parasite species and to analyze genes of species in the same geographical area. We consider optimization problems in tanglegram drawings.

View Article and Find Full Text PDF

The Multi-State Perfect Phylogeny Problem is an extension of the Binary Perfect Phylogeny Problem, allowing characters to take on more than two states. In this article, we consider three problems that extend the utility of the multi-state perfect phylogeny model: (1) the Missing Data (MD) Problem, where some entries in the input are missing and the question is whether (bounded) values for the missing data can be imputed so that the resulting data has a multi-state perfect phylogeny; (2) the Character-Removal (CR) Problem, where we want to minimize the number of characters to remove from the data so that the resulting data has a multi-state perfect phylogeny; and (3) the Missing-Data Character-Removal (MDCR) Problem, where the input has missing data and we want to impute values for the missing data to minimize the solution to the resulting Character-Removal Problem. We discuss Integer Linear Programming (ILP) solutions to these problems for the special case of three, four, and five permitted states per character, and we report on extensive empirical testing of these solutions.

View Article and Find Full Text PDF

Background: The problem of computationally predicting the secondary structure (or folding) of RNA molecules was first introduced more than thirty years ago and yet continues to be an area of active research and development. The basic RNA-folding problem of finding a maximum cardinality, non-crossing, matching of complimentary nucleotides in an RNA sequence of length n, has an O(n3)-time dynamic programming solution that is widely applied. It is known that an o(n3) worst-case time solution is possible, but the published and suggested methods are complex and have not been established to be practical.

View Article and Find Full Text PDF

Phylogenetic networks are models of evolution that go beyond trees, incorporating non-tree-like biological events such as recombination (or more generally reticulation), which occur either in a single species (meiotic recombination) or between species (reticulation due to lateral gene transfer and hybrid speciation). The central algorithmic problems are to reconstruct a plausible history of mutations and non-tree-like events, or to determine the minimum number of such events needed to derive a given set of binary sequences, allowing one mutation per site. Meiotic recombination, reticulation and recurrent mutation can cause conflict or incompatibility between pairs of sites (or characters) of the input.

View Article and Find Full Text PDF

Meiotic recombination is a fundamental biological event and one of the principal evolutionary forces responsible for shaping genetic variation within species. In addition to its fundamental role, recombination is central to several critical applied problems. The most important example is "association mapping" in populations, which is widely hoped to help find genes that influence genetic diseases (Carlson et al.

View Article and Find Full Text PDF

A current major focus in genomics is the large-scale collection of genotype data in populations in order to detect variations in the population. The variation data are sought in order to address fundamental and applied questions in genetics that concern the haplotypes in the population. Since, almost all the collected data is in the form of genotypes, but the downstream genetics questions concern haplotypes, the standard approach to this issue has been to try to first infer haplotypes from the genotypes, and then answer the downstream questions using the inferred haplotypes.

View Article and Find Full Text PDF

A current major focus in genomics is the large-scale collection of genotype data in populations in order to detect variations in the population. The variation data are sought in order to address fundamental and applied questions in genetics that concern the haplotypes in the population. Since almost all the collected data is in the form of genotypes, but the downstream genetics questions concern haplotypes, the standard approach to this issue has been to try to first infer haplotypes from the genotypes, and then answer the downstream questions using the inferred haplotypes.

View Article and Find Full Text PDF

Since the introduction of the Perfect Phylogeny Haplotyping (PPH) Problem in RECOMB 2002 (Gusfield, 2002), the problem of finding a linear-time (deterministic, worst-case) solution for it has remained open, despite broad interest in the PPH problem and a series of papers on various aspects of it. In this paper, we solve the open problem, giving a practical, deterministic linear-time algorithm based on a simple data structure and simple operations on it. The method is straightforward to program and has been fully implemented.

View Article and Find Full Text PDF

A phylogenetic network is a generalization of a phylogenetic tree, allowing structural properties that are not tree-like. With the growth of genomic data, much of which does not fit ideal tree models, there is greater need to understand the algorithmics and combinatorics of phylogenetic networks [10, 11]. However, to date, very little has been published on this, with the notable exception of the paper by Wang et al.

View Article and Find Full Text PDF

Motivation: We are interested in studying the evolution of DNA single nucleotide polymorphism sequences which have undergone (meiotic) recombination. For a given set of sequences, computing the minimum number of recombinations needed to explain the sequences (with one mutation per site) is a standard question of interest, but it has been shown to be NP-hard, and previous algorithms that compute it exactly work either only on very small datasets or on problems with special structure.

Results: In this paper, we present efficient, practical methods for computing both upper and lower bounds on the minimum number of needed recombinations, and for constructing evolutionary histories that explain the input sequences.

View Article and Find Full Text PDF

The problem of inferring haplotype phase from a population of genotypes has received a lot of attention recently. This is partly due to the observation that there are many regions on human genomic DNA where genetic recombination is rare (Helmuth, 2001; Daly et al., 2001; Stephens et al.

View Article and Find Full Text PDF

A phylogenetic network is a generalization of a phylogenetic tree, allowing structural properties that are not tree-like. In a seminal paper, Wang et al.(1) studied the problem of constructing a phylogenetic network, allowing recombination between sequences, with the constraint that the resulting cycles must be disjoint.

View Article and Find Full Text PDF

Pedigree analysis is a central component of many current efforts to locate genes that contribute to diseases or to valuable traits. The analysis usually involves solving one of two very computation-intense problems. We analyze the complexity of these two problems.

View Article and Find Full Text PDF

A full haplotype map of the human genome will prove extremely valuable as it will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. A haplotype map project has been announced by NIH. The biological key to that project is the surprising fact that some human genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected (Helmuth, 2001; Daly et al.

View Article and Find Full Text PDF

We have developed an efficient program, the Perfect Phylogeny Haplotyper (PPH) that takes in unphased population genotype data, and determines if that data can be explained by haplotype pairs that could have evolved on a perfect phylogeny.

View Article and Find Full Text PDF