CMF-Impute: an accurate imputation tool for single-cell RNA-seq data.

Junlin Xu Lijun Cai Bo Liao Wen Zhu JiaLiang Yang

Bioinformatics

School of Mathematics and Statistics, Hainan Normal University, Haikou 570100, P.R. China.

Published: May 2020

Motivation: Single-cell RNA-sequencing (scRNA-seq) technology provides a powerful tool for investigating cell heterogeneity and cell subpopulations by allowing the quantification of gene expression at single-cell level. However, scRNA-seq data analysis remains challenging because of various technical noises such as dropout events (i.e. excessive zero counts in the expression matrix).

Results: By taking consideration of the association among cells and genes, we propose a novel collaborative matrix factorization-based method called CMF-Impute to impute the dropout entries of a given scRNA-seq expression matrix. We test CMF-Impute and compare it with the other five state-of-the-art methods on six popular real scRNA-seq datasets of various sizes and three simulated datasets. For simulated datasets, CMF-Impute outperforms other methods in imputing the closest dropouts to the original expression values as evaluated by both the sum of squared error and Pearson correlation coefficient. For real datasets, CMF-Impute achieves the most accurate cell classification results in spite of the choice of different clustering methods like SC3 or T-SNE followed by K-means as evaluated by both adjusted rand index and normalized mutual information. Finally, we demonstrate that CMF-Impute is powerful in reconstructing cell-to-cell and gene-to-gene correlation, and in inferring cell lineage trajectories.

Availability And Implementation: CMF-Impute is written as a Matlab package which is available at https://github.com/xujunlin123/CMFImpute.git.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF	Source
http://dx.doi.org/10.1093/bioinformatics/btaa109	DOI Listing

Publication Analysis

Top Keywords

simulated datasets

datasets cmf-impute

cmf-impute

cmf-impute accurate

accurate imputation

imputation tool

tool single-cell

single-cell rna-seq

rna-seq data

data motivation

Similar Publications

BetaAlign: a deep learning approach for multiple sequence alignment.

Bioinformatics

January 2025

The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.

Edo Dotan Elya Wygoda Noa Ecker Michael Alburquerque Oren Avram

Article Synopsis

The study explores a novel method for multiple sequence alignments in bioinformatics using natural language processing (NLP) techniques.
Researchers developed BetaAlign, a deep learning aligner that outperforms traditional alignment algorithms and offers highly accurate results by leveraging transformer models.
The findings highlight the potential of AI-based approaches to improve alignment tasks and advance phylogenomics, with training data and tools made available through Hugging Face.

View Article and Find Full Text PDF

Similar Publications

Stimulation Effects Mapping for Optimizing Coil Placement for Transcranial Magnetic Stimulation.

Neuroinformatics

January 2025

Brainnetome Center, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China.

Gangliang Zhong Fang Jin Liang Ma Yongfeng Yang Baogui Zhang

The position and orientation of transcranial magnetic stimulation (TMS) coil, which we collectively refer to as coil placement, significantly affect both the assessment and modulation of cortical excitability. TMS electric field (E-field) simulation can be used to identify optimal coil placement. However, the present E-field simulation required a laborious segmentation and meshing procedure to determine optimal coil placement.

View Article and Find Full Text PDF

Similar Publications

Bias in mobility datasets drives divergence in modeled outbreak dynamics.

Commun Med (Lond)

January 2025

Department of Demography, University of California, Berkeley, California, USA.

Taylor Chin Michael A Johansson Anir Chowdhury Shayan Chowdhury Kawsar Hosan

Background: Digital data sources such as mobile phone call detail records (CDRs) are increasingly being used to estimate population mobility fluxes and to predict the spatiotemporal dynamics of infectious disease outbreaks. Differences in mobile phone operators' geographic coverage, however, may result in biased mobility estimates.

Methods: We leverage a unique dataset consisting of CDRs from three mobile phone operators in Bangladesh and digital trace data from Meta's Data for Good program to compare mobility patterns across these sources.

View Article and Find Full Text PDF

Similar Publications

Uncovering blood-brain barrier permeability: a comparative study of machine learning models using molecular fingerprints, and SHAP explainability.

SAR QSAR Environ Res

December 2024

School of Computing and Data Sciences, FLAME University, Pune, India.

E Raveendrakumar B Gopichand H Bhosale N Melethadathil J Valadi

This study illustrates the use of chemical fingerprints with machine learning for blood-brain barrier (BBB) permeability prediction. Employing the Blood Brain Barrier Database (B3DB) dataset for BBB permeability prediction, we extracted nine different fingerprints. Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost) algorithms were used to develop models for permeability prediction.

View Article and Find Full Text PDF

Similar Publications

Comparative Evaluation of Open-Source Bioinformatics Pipelines for Full-Length Viral Genome Assembly.

Viruses

November 2024

Institute of Biology, ELTE Eötvös Loránd University, 1117 Budapest, Hungary.

Levente Zsichla Marius Zeeb Dávid Fazekas Éva Áy Dalma Müller

The increasingly widespread application of next-generation sequencing (NGS) in clinical diagnostics and epidemiological research has generated a demand for robust, fast, automated, and user-friendly bioinformatics workflows. To guide the choice of tools for the assembly of full-length viral genomes from NGS datasets, we assessed the performance and applicability of four open-source bioinformatics pipelines (shiver-for which we created a user-friendly Dockerized version, referred to as dshiver; SmaltAlign; viral-ngs; and V-pipe) using both simulated and real-world HIV-1 paired-end short-read datasets and default settings. All four pipelines produced consensus genome assemblies with high quality metrics (genome fraction recovery, mismatch and indel rates, variant calling F1 scores) when the reference sequence used for assembly had high similarity to the analyzed sample.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!