For many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common interpretation and validation approach involves testing differences in feature means between observations in two estimated clusters. In this setting, classical hypothesis tests lead to an inflated Type I error rate.
View Article and Find Full Text PDFClassical tests for a difference in means control the type I error rate when the groups are defined . However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means.
View Article and Find Full Text PDFFor many applications, it is critical to interpret and validate groups of observations obtained via clustering. A common validation approach involves testing differences in feature means between observations in two estimated clusters. In this setting, classical hypothesis tests lead to an inflated Type I error rate.
View Article and Find Full Text PDFWhole-chromosome aneuploidy and large segmental amplifications can have devastating effects in multicellular organisms, from developmental disorders and miscarriage to cancer. Aneuploidy in single-celled organisms such as yeast also results in proliferative defects and reduced viability. Yet, paradoxically, CNVs are routinely observed in laboratory evolution experiments with microbes grown in stressful conditions.
View Article and Find Full Text PDFIn the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values in the second step will fail to achieve statistical guarantees such as Type 1 error control.
View Article and Find Full Text PDFWe consider conducting inference on the output of the Classification and Regression Tree (CART) (Breiman et al., 1984) algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage.
View Article and Find Full Text PDFIn this paper, we consider data consisting of multiple networks, each composed of a different edge set on a common set of nodes. Many models have been proposed for the analysis of such multiview network data under the assumption that the data views are closely related. In this paper, we provide tools for evaluating this assumption.
View Article and Find Full Text PDFIn the Pioneer 100 (P100) Wellness Project, multiple types of data are collected on a single set of healthy participants at multiple timepoints in order to characterize and optimize wellness. One way to do this is to identify clusters, or subgroups, among the participants, and then to tailor personalized health recommendations to each subgroup. It is tempting to cluster the participants using all of the data types and timepoints, in order to fully exploit the available information.
View Article and Find Full Text PDF