The motivation of this paper is to provide a tree-based method for grouping multinomial data according to their classification probability vectors. We produce an initial tree by binary recursive partitioning whereby multinomials are successively split into two subsets and the splits are determined by maximizing the likelihood function. If the number of multinomials k is too large, we propose to order the multinomials, and then build the initial tree based on a dramatically smaller number k-1 of possible splits. The tree is then pruned from the bottom up. The pruning process involves a sequence of hypothesis tests of a single homogeneous group against the alternative that there are two distinct, internally homogeneous groups. As pruning criteria, the Bayesian information criterion and the Wilcoxon rank-sum test are proposed. The tree-based model is illustrated on genetic sequence data. Homogeneous groupings of genetic sequences present new opportunities to understand and align these sequences.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1002/sim.2182 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!