Semi-supervised adaptive-height snipping of the hierarchical clustering tree.

BMC Bioinformatics

Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands.

Published: January 2015

Background: In genomics, hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure. HC algorithms do not actually create clusters, but compute a hierarchical representation of the data set. Usually, a fixed height on the HC tree is used, and each contiguous branch of samples below that height is considered a separate cluster. Due to the fixed-height cutting, those clusters may not unravel significant functional coherence hidden deeper in the tree. Besides that, most existing approaches do not make use of available clinical information to guide cluster extraction from the HC. Thus, the identified subgroups may be difficult to interpret in relation to that information.

Results: We develop a novel framework for decomposing the HC tree into clusters by semi-supervised piecewise snipping. The framework, called guided piecewise snipping, utilizes both molecular data and clinical information to decompose the HC tree into clusters. It cuts the given HC tree at variable heights to find a partition (a set of non-overlapping clusters) which does not only represent a structure deemed to underlie the data from which HC tree is derived, but is also maximally consistent with the supplied clinical data. Moreover, the approach does not require the user to specify the number of clusters prior to the analysis. Extensive results on simulated and multiple medical data sets show that our approach consistently produces more meaningful clusters than the standard fixed-height cut and/or non-guided approaches.

Conclusions: The guided piecewise snipping approach features several novelties and advantages over existing approaches. The proposed algorithm is generic, and can be combined with other algorithms that operate on detected clusters. This approach represents an advancement in several regards: (1) a piecewise tree snipping framework that efficiently extracts clusters by snipping the HC tree possibly at variable heights while preserving the HC tree structure; (2) a flexible implementation allowing a variety of data types for both building and snipping the HC tree, including patient follow-up data like survival as auxiliary information. The data sets and R code are provided as supplementary files. The proposed method is available from Bioconductor as the R-package HCsnip.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4302100PMC
http://dx.doi.org/10.1186/s12859-014-0448-1DOI Listing

Publication Analysis

Top Keywords

piecewise snipping
12
tree
11
clusters
9
hierarchical clustering
8
data
8
existing approaches
8
tree clusters
8
snipping framework
8
guided piecewise
8
tree variable
8

Similar Publications

HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree.

Cancer Inform

April 2015

Department of Epidemiology and Biostatistics, Vrije Universiteit Medical Center, Amsterdam, The Netherlands. ; Department of Mathematics, Vrije Universiteit, Amsterdam, The Netherlands.

Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster.

View Article and Find Full Text PDF

Semi-supervised adaptive-height snipping of the hierarchical clustering tree.

BMC Bioinformatics

January 2015

Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands.

Background: In genomics, hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure. HC algorithms do not actually create clusters, but compute a hierarchical representation of the data set. Usually, a fixed height on the HC tree is used, and each contiguous branch of samples below that height is considered a separate cluster.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!