CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data.

J King Saud Univ Comput Inf Sci

School of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UK.

Published: October 2023

AI Article Synopsis

  • - The study addresses the challenges in gene expression data analysis, particularly high dimensionality, limited sample sizes, and feature redundancy, by proposing a new algorithm called Clustering-Guided Unsupervised Feature Selection (CGUFS).
  • - CGUFS offers three key improvements: an adaptive strategy for assigning cluster pseudo-labels, a feature grouping method to handle redundancy, and an adaptive filtering strategy to retain the most relevant features.
  • - Experimental results demonstrate that CGUFS outperforms existing algorithms, achieving higher accuracy rates (74.37% for C4.5 and significantly improved results for the Adaboost classifier) in selecting optimal features.

Article Abstract

Aim: Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure while not taking into account the redundancy among features. Determining the appropriate number of significant features is another challenge.

Method: In this paper, we propose a clustering-guided unsupervised feature selection (CGUFS) algorithm for gene expression data that addresses these problems. Our proposed algorithm introduces three improvements over existing algorithms. For the problem that existing clustering algorithms require artificially specifying the number of clusters, we propose an adaptive -value strategy to assign appropriate pseudo-labels to each sample by iteratively updating a change function. For the problem that existing algorithms fail to consider the redundancy among features, we propose a feature grouping strategy to group highly redundant features. For the problem that the existing algorithms cannot filter the redundant features, we propose an adaptive filtering strategy to determine the feature combinations to be retained by calculating the potentially effective features and potentially redundant features of each feature group.

Result: Experimental results show that the average accuracy (ACC) and matthews correlation coefficient (MCC) indexes of the C4.5 classifier on the optimal features selected by the CGUFS algorithm reach 74.37% and 63.84%, respectively, significantly superior to the existing algorithms.

Conclusion: Similarly, the average ACC and MCC indexes of the Adaboost classifier on the optimal features selected by the CGUFS algorithm are significantly superior to the existing algorithms. In addition, statistical experiment results show significant differences between the CGUFS algorithm and the existing algorithms.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7615789PMC
http://dx.doi.org/10.1016/j.jksuci.2023.101731DOI Listing

Publication Analysis

Top Keywords

existing algorithms
20
cgufs algorithm
16
unsupervised feature
12
feature selection
12
gene expression
12
expression data
12
problem existing
12
redundant features
12
features
11
clustering-guided unsupervised
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!