Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework.

Valentin Voillet Philippe Besse Laurence Liaubet Magali San Cristobal Ignacio González

BMC Bioinformatics

INRAUR875 Mathématiques et Informatiques Appliquées, F-31326, Castanet-Tolosan, France.

Published: October 2016

Background: In omics data integration studies, it is common, for a variety of reasons, for some individuals to not be present in all data tables. Missing row values are challenging to deal with because most statistical methods cannot be directly applied to incomplete datasets. To overcome this issue, we propose a multiple imputation (MI) approach in a multivariate framework. In this study, we focus on multiple factor analysis (MFA) as a tool to compare and integrate multiple layers of information. MI involves filling the missing rows with plausible values, resulting in M completed datasets. MFA is then applied to each completed dataset to produce M different configurations (the matrices of coordinates of individuals). Finally, the M configurations are combined to yield a single consensus solution.

Results: We assessed the performance of our method, named MI-MFA, on two real omics datasets. Incomplete artificial datasets with different patterns of missingness were created from these data. The MI-MFA results were compared with two other approaches i.e., regularized iterative MFA (RI-MFA) and mean variable imputation (MVI-MFA). For each configuration resulting from these three strategies, the suitability of the solution was determined against the true MFA configuration obtained from the original data and a comprehensive graphical comparison showing how the MI-, RI- or MVI-MFA configurations diverge from the true configuration was produced. Two approaches i.e., confidence ellipses and convex hulls, to visualize and assess the uncertainty due to missing values were also described. We showed how the areas of ellipses and convex hulls increased with the number of missing individuals. A free and easy-to-use code was proposed to implement the MI-MFA method in the R statistical environment.

Conclusions: We believe that MI-MFA provides a useful and attractive method for estimating the coordinates of individuals on the first MFA components despite missing rows. MI-MFA configurations were close to the true configuration even when many individuals were missing in several data tables. This method takes into account the uncertainty of MI-MFA configurations induced by the missing rows, thereby allowing the reliability of the results to be evaluated.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5048483	PMC
http://dx.doi.org/10.1186/s12859-016-1273-5	DOI Listing

Publication Analysis

Top Keywords

missing rows

data integration

multiple imputation

multiple factor

factor analysis

data tables

coordinates individuals

true configuration

ellipses convex

convex hulls

Similar Publications

Accurate predictions on small data with a tabular foundation model.

Nature

January 2025

Machine Learning Lab, University of Freiburg, Freiburg, Germany.

Noah Hollmann Samuel Müller Lennart Purucker Arjun Krishnakumar Max Körfer

Tabular data, spreadsheets organized in rows and columns, are ubiquitous across scientific fields, from biomedicine to particle physics to economics and climate science. The fundamental prediction task of filling in missing values of a label column based on the rest of the columns is essential for various applications as diverse as biomedical risk models, drug discovery and materials science. Although deep learning has revolutionized learning from raw data and led to numerous high-profile success stories, gradient-boosted decision trees have dominated tabular data for the past 20 years.

View Article and Find Full Text PDF

Similar Publications

Empirical Bayes Linked Matrix Decomposition.

Mach Learn

October 2024

Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minneapolis, 55455, MN, USA.

Eric F Lock

Data for several applications in diverse fields can be represented as multiple matrices that are linked across rows or columns. This is particularly common in molecular biomedical research, in which multiple molecular "omics" technologies may capture different feature sets (e.g.

View Article and Find Full Text PDF

Similar Publications

A new species of green pitviper of the Trimeresurus macrops complex (Reptilia: Serpentes: Viperidae) from South Central Coastal Region of Vietnam.

Zootaxa

July 2024

Department of Vertebrate Zoology; Lomonosov Moscow State University; Leninskiye Gory; GSP-1; Moscow 119991; Russia; Joint Vietnam - Russia Tropical Science and Technology Research Center; Nghia Do; Cau Giay; Hanoi; Vietnam.

Sabira S Idiiatullina Tan VAN Nguyen Andrey M Bragin Parinya Pawangkhanant Dac Xuan LE

Article Synopsis

A new species of green pitviper named Trimeresurus cyanolabris sp. nov. has been identified in southern and central coastal Vietnam, based on both physical features and DNA analysis.
This species is characterized by its small size (maximum 638 mm), distinct scale patterns, and unique coloration, including bright yellow eyes and blue shades on its throat and chin.
The study highlights the phylogenetic separation of this species from similar ones, particularly T. rubeus, and emphasizes the ecological significance of its habitat in tropical forests, which is threatened by deforestation.

View Article and Find Full Text PDF

Similar Publications

Fast matrix completion in epigenetic methylation studies with informative covariates.

Biostatistics

October 2024

Department of Mathematics, Université du Québec à Montreal, 201, Ave Président-Kennedy Montreal (QC), H2X 3Y7 Montreal, Canada.

Mélina Ribaud Aurélie Labbe Khaled Fouda Karim Oualkacha

DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies.

View Article and Find Full Text PDF

Similar Publications

A new genus and four new species of Darnini (Hemiptera: Membracidae) from South America.

Zootaxa

February 2024

USDA Agricultural Research Service; Systematic Entomology Laboratory; c/o NMNH; MRC-168; Smithsonian Institution; P.O. Box 37012; Washington; DC; 20013-7012; U.S.A..

Laura Gonzlez-Mozo Stuart H McKamey

The new genus Polyodontotrochus is described and illustrated with four new species: P. auriculatus from French Guiana, P. elevatus (type species) from Ecuador, P.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!