Optimal clustering with missing values.

BMC Bioinformatics

Department of Electrical and Computer Engineering, Texas A&M University, MS3128 TAMU, College Station, 77843, TX, USA.

Published: June 2019

Background: Missing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements. Missing values can complicate the application of clustering algorithms, whose goals are to group points based on some similarity criterion. A common practice for dealing with missing values in the context of clustering is to first impute the missing values, and then apply the clustering algorithm on the completed data.

Results: We consider missing values in the context of optimal clustering, which finds an optimal clustering operator with reference to an underlying random labeled point process (RLPP). We show how the missing-value problem fits neatly into the overall framework of optimal clustering by incorporating the missing value mechanism into the random labeled point process and then marginalizing out the missing-value process. In particular, we demonstrate the proposed framework for the Gaussian model with arbitrary covariance structures. Comprehensive experimental studies on both synthetic and real-world RNA-seq data show the superior performance of the proposed optimal clustering with missing values when compared to various clustering approaches.

Conclusion: Optimal clustering with missing values obviates the need for imputation-based pre-processing of the data, while at the same time possessing smaller clustering errors.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6584727PMC
http://dx.doi.org/10.1186/s12859-019-2832-3DOI Listing

Publication Analysis

Top Keywords

missing values
32
optimal clustering
24
clustering missing
12
missing
10
clustering
10
values
8
values context
8
random labeled
8
labeled point
8
point process
8

Similar Publications

Motivation: Missing values are prevalent in high-throughput measurements due to various experimental or analytical reasons. Imputation, the process of replacing missing values in a dataset with estimated values, plays an important role in multivariate and machine learning analyses. The three missingness patterns, including missing completely at random, missing at random, and missing not at random, describe unique dependencies between the missing and observed data.

View Article and Find Full Text PDF

Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements; (2) flexible sequence modeling, such as binding profiles of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the of the missing value (i.

View Article and Find Full Text PDF

Background: There is a shortage of patients with hypertrophic cardiomyopathy (HCM) with concurrent coronary artery disease (CAD), and the influence of CAD on the prognosis of patients with HCM is uncertain. This real-world cohort study was conducted to evaluate the prognosis of patients with patients with CAD.

Methods: This cohort study of patients with HCM was conducted from May 2003 to September 2021.

View Article and Find Full Text PDF

Background Different pathologies are encountered more often in human immunodeficiency virus (HIV)-infected patients, such as bacterial, fungal, viral infection, and neoplastic diseases. Recently, studies have shown that HIV-infected individuals have poorer oral health outcomes, worse dentition, and aggressive forms of periodontitis. This study aims to investigate the dental and periodontal status of HIV-infected patients, the correlation between CD4+ level and the CD4 percentage with dentition, and periodontal status.

View Article and Find Full Text PDF

Purpose: To develop and validate a prostate-specific membrane antigen (PSMA) PET/CT based multimodal deep learning model for predicting pathological lymph node invasion (LNI) in prostate cancer (PCa) patients identified as candidates for extended pelvic lymph node dissection (ePLND) by preoperative nomograms.

Methods: [Ga]Ga-PSMA-617 PET/CT scan of 116 eligible PCa patients (82 in the training cohort and 34 in the test cohort) who underwent radical prostatectomy with ePLND were analyzed in our study. The Med3D deep learning network was utilized to extract discriminative features from the entire prostate volume of interest on the PET/CT images.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!