Integrating single-cell RNA-seq datasets with substantial batch effects.

Karin Hrovatin Amir Ali Moinfar Luke Zappia Alejandro Tejada Lapuerta Ben Lengerich Manolis Kellis Fabian J Theis

bioRxiv

Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany.

Published: February 2024

Integration of single-cell RNA-sequencing (scRNA-seq) datasets has become a standard part of the analysis, with conditional variational autoencoders (cVAE) being among the most popular approaches. Increasingly, researchers are asking to map cells across challenging cases such as cross-organs, species, or organoids and primary tissue, as well as different scRNA-seq protocols, including single-cell and single-nuclei. Current computational methods struggle to harmonize datasets with such substantial differences, driven by technical or biological variation. Here, we propose to address these challenges for the popular cVAE-based approaches by introducing and comparing a series of regularization constraints. The two commonly used strategies for increasing batch correction in cVAEs, that is Kullback-Leibler divergence (KL) regularization strength tuning and adversarial learning, suffer from substantial loss of biological information. Therefore, we adapt, implement, and assess alternative regularization strategies for cVAEs and investigate how they improve batch effect removal or better preserve biological variation, enabling us to propose an optimal cVAE-based integration strategy for complex systems. We show that using a VampPrior instead of the commonly used Gaussian prior not only improves the preservation of biological variation but also unexpectedly batch correction. Moreover, we show that our implementation of cycle-consistency loss leads to significantly better biological preservation than adversarial learning implemented in the previously proposed GLUE model. Additionally, we do not recommend relying only on the KL regularization strength tuning for increasing batch correction, as it removes both biological and batch information without discriminating between the two. Based on our findings, we propose a new model that combines VampPrior and cycle-consistency loss. We show that using it for datasets with substantial batch effects improves downstream interpretation of cell states and biological conditions. To ease the use of the newly proposed model, we make it available in the scvi-tools package as an external model named sysVI. Moreover, in the future, these regularization techniques could be added to other established cVAE-based models to improve the integration of datasets with substantial batch effects.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10635119	PMC
http://dx.doi.org/10.1101/2023.11.03.565463	DOI Listing

Publication Analysis

Top Keywords

datasets substantial

substantial batch

batch effects

biological variation

batch correction

batch

increasing batch

regularization strength

strength tuning

adversarial learning

Similar Publications

The Visual Integration of Semantic and Spatial Information of Objects in Naturalistic Scenes (VISIONS) database: attentional, conceptual, and perceptual norms.

Behav Res Methods

January 2025

Department of Psychology, Sapienza, University of Rome, Rome, Italy.

Elena Allegretti Giorgia D'Innocenzo Moreno I Coco

The complex interplay between low- and high-level mechanisms governing our visual system can only be fully understood within ecologically valid naturalistic contexts. For this reason, in recent years, substantial efforts have been devoted to equipping the scientific community with datasets of realistic images normed on semantic or spatial features. Here, we introduce VISIONS, an extensive database of 1136 naturalistic scenes normed on a wide range of perceptual and conceptual norms by 185 English speakers across three levels of granularity: isolated object, whole scene, and object-in-scene.

View Article and Find Full Text PDF

Similar Publications

A Review on Integrating Breast Cancer Clinical Data: A Unified Platform Perspective.

Curr Treat Options Oncol

January 2025

Department of Pharmacognosy, JSS College of Pharmacy, JSS Academy of Higher Education & Research, Mysuru, Karnataka, India.

Ram Mohan Ram Kumar Suresh Joghee

Integrating clinical datasets in breast cancer research emerges as a necessary tool for advancing our knowledge of the disease and enhancing patient outcomes. Synthesizing diverse datasets offers advantages, from facilitating evidence-based insights to enabling predictive analytics and precision medicine strategies. Crucially, effective integration of clinical datasets necessitates collaborative efforts, policy interventions, and technological advancements to elevate global standards of breast cancer care.

View Article and Find Full Text PDF

Similar Publications

Basic Science and Pathogenesis.

Alzheimers Dement

December 2024

Stanford University, Stanford, CA, USA.

Michael E Belloy Danielle M Reid Yann Le Guen Valerio Napolioni Michael D Greicius

Background: APOE*4 is the strongest genetic risk for late-onset Alzheimer's disease (AD), but other genetic loci may counter its detrimental effect, providing therapeutic avenues. Expanding beyond non-Hispanic White subjects, we sought to additionally leverage genetic data from non-Hispanic and Hispanic subjects of admixed African ancestry to perform trans-ancestry APOE*4-stratified GWAS, anticipating that allele frequency differences across populations would boost power for gene discovery.

Method: Participants were ages 60+, of European (EU; ≥75%) or admixed African (AFR; ≥25%) ancestry, and diagnosed as cases or controls.

View Article and Find Full Text PDF

Similar Publications

Basic Science and Pathogenesis.

Alzheimers Dement

December 2024

Indiana University School of Medicine, Indianapolis, IN, USA.

Dongbing Lai Michael Zhang Tatiana M Foroud

Background: APOE e4 has been used to evaluate the risk for Alzheimer's diseases (AD) but there exist other AD risk genes, and their effects can be collectively measured by polygenic risk scores (PRS). In this study, we sought to use both PRS (APOE excluded) and APOE e4 to evaluate the AD risk.

Method: The discovery dataset was meta-analysis of three large-scale European ancestry AD GWAS (Kunkle et al, 2019, the UK Biobank, and the FinnGen consortium).

View Article and Find Full Text PDF

Similar Publications

Basic Science and Pathogenesis.

Alzheimers Dement

December 2024

Penn Neurodegeneration Genomics Center, Dept of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, USA.

Emily Greenfest-Allen Otto Valladares Prabhakaran Gangadharan Heather White Amanda B Kuzma

Article Synopsis

NIAGADS is a national data repository providing access to genomic data related to Alzheimer's disease, focusing on curating and standardizing information from various sources for researchers.
To improve data navigation, NIAGADS has developed AI enhancements using large language models trained on its documentation and API, which help users with complex queries and facilitate data discovery.
These AI improvements include rule-based chatbots and generative search tools that respond to user inquiries, learn from feedback, and assist in common issues, ultimately enhancing the overall user experience on NIAGADS platforms.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!