Unsupervised clustering of single-cell RNA-sequencing data enables the identification of distinct cell populations. However, the most widely used clustering algorithms are heuristic and do not formally account for statistical uncertainty. We find that not addressing known sources of variability in a statistically rigorous manner can lead to overconfidence in the discovery of novel cell types. Here we extend a previous method, significance of hierarchical clustering, to propose a model-based hypothesis testing approach that incorporates significance analysis into the clustering algorithm and permits statistical evaluation of clusters as distinct cell populations. We also adapt this approach to permit statistical assessment on the clusters reported by any algorithm. Finally, we extend these approaches to account for batch structure. We benchmarked our approach against popular clustering workflows, demonstrating improved performance. To show practical utility, we applied our approach to the Human Lung Cell Atlas and an atlas of the mouse cerebellar cortex, identifying several cases of over-clustering and recapitulating experimentally validated cell type definitions.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11282907PMC
http://dx.doi.org/10.1038/s41592-023-01933-9DOI Listing

Publication Analysis

Top Keywords

significance analysis
8
analysis clustering
8
clustering single-cell
8
single-cell rna-sequencing
8
rna-sequencing data
8
distinct cell
8
cell populations
8
clustering
6
cell
5
data unsupervised
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!