It is fundamentally important to assess the fit of data to model in phylogenetic and evolutionary studies. Phylogenetic methods using molecular sequences typically start with a multiple alignment. It is possible to measure the fit of data to model expectations of data, for example, via the likelihood-ratio (G) test or the X(2) test, if all sites in all sequences have an unambiguous residue. However, nearly all alignments of interest contain sites (columns of the alignment) with missing data, that is, ambiguous nucleotides, gaps, or unsequenced regions, which must presently be removed before using the above tests. Unfortunately, this is often either undesirable or impractical, as it will discard much of the data. Here, we show how iterative ML estimators may directly estimate the site-pattern probabilities for columns with missing data, given only standard i.i.d. assumptions. The optimization may use an EM or Newton algorithm, or any other hill-climbing approach. The resulting optimal likelihood under the unconstrained or multinomial model may be compared directly with the likelihood of the data coming from the model (a G statistic). Alternatively the modified observed and the expected frequencies of site patterns may be compared using a X(2) test. The distribution of such statistics is best assessed using appropriate simulations. The new method is applicable to models using codons or paired sites. The methods are also useful with Hadamard conjugations (spectral analysis) and are illustrated with these and with ML evolutionary models that allow site-rate variability.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1093/molbev/msi002 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!