Over the past few decades, there has been an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining data sets to achieve unprecedented sample sizes, spatial coverage or temporal replication in population genomic studies. However, a common concern is that nonbiological differences between data sets may generate patterns of variation in the data that can confound real biological patterns, a problem known as batch effects. In this paper, we compare two batches of low-coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a "batch-effect-naive" bioinformatic pipeline, batch effects systematically biased our genetic diversity estimates, population structure inference and selection scans. We then demonstrate that these batch effects resulted from multiple technical differences between our data sets, including the sequencing chemistry (four-channel vs. two-channel), sequencing run, read type (single-end vs. paired-end), read length (125 vs. 150 bp), DNA degradation level (degraded vs. well preserved) and sequencing depth (0.8× vs. 0.3× on average). Lastly, we illustrate that a set of simple bioinformatic strategies (such as different read trimming and single nucleotide polymorphism filtering) can be used to detect batch effects in our data and substantially mitigate their impact. We conclude that combining data sets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies.

Download full-text PDF

Source
http://dx.doi.org/10.1111/1755-0998.13559DOI Listing

Publication Analysis

Top Keywords

batch effects
28
data sets
16
data
10
population genomic
8
genomic studies
8
low-coverage genome
8
genome sequencing
8
sequencing data
8
combining data
8
differences data
8

Similar Publications

STAIG: Spatial transcriptomics analysis via image-aided graph contrastive learning for domain exploration and alignment-free integration.

Nat Commun

January 2025

Department of Computational Biology and Medical Science, Graduate School of Frontier Sciences, the University of Tokyo, Tokyo, Japan.

Spatial transcriptomics is an essential application for investigating cellular structures and interactions and requires multimodal information to precisely study spatial domains. Here, we propose STAIG, a deep-learning model that integrates gene expression, spatial coordinates, and histological images using graph-contrastive learning coupled with high-performance feature extraction. STAIG can integrate tissue slices without prealignment and remove batch effects.

View Article and Find Full Text PDF

ManiNeg: Manifestation-guided multimodal pretraining for mammography screening.

Comput Biol Med

January 2025

School of Automation Science and Engineering, South China University of Technology, Guangzhou, China. Electronic address:

Breast cancer poses a significant health threat worldwide. Contrastive learning has emerged as an effective method to extract critical lesion features from mammograms, thereby offering a potent tool for breast cancer screening and analysis. A crucial aspect of contrastive learning is negative sampling, where the selection of hard negative samples is essential for driving representations to retain detailed lesion information.

View Article and Find Full Text PDF

This 30-color panel was developed to enable the enumeration and purification of distinct circulating immune cell subsets implicated in the pathogenesis of systemic autoimmune diseases including rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), systemic sclerosis (SSc; scleroderma), Sjögren's disease (SjD), idiopathic inflammatory myopathy (IIM), and others. While designed for application to peripheral blood mononuclear cells, the inclusion of CD45 coupled with the ability to extract cellular autofluorescence spectral signatures enables the application of this panel to other tissue types. Of the 30 total markers, this panel employs 18 markers to profile T cell subsets consisting of different memory subsets and T helper polarities, > 10 markers to profile B cell subsets including double-negative B cells, and a total of 8 lineage markers to identify immune lineages including monocyte and natural killer cell subsets, conventional dendritic cells, plasmacytoid dendritic cells, and basophils.

View Article and Find Full Text PDF

Additive Manufacturing (AM) was evaluated as a promising technology for constructing photocatalytic reactors due to its inherent ability to produce complex geometries with high precision and customization. In this work, a 3D structure was designed to achieve a good light distribution inside a cylindrical batch reactor and printed using the stereolithography (SLA) technique. A hybrid material composed of a commercial photoreactive resin (Formlabs Clear V4) and the benchmark photocatalyst TiO P25 Evonik (1 wt%) was prepared and characterized by scanning electron microscopy (SEM) and rheological and mechanical methods.

View Article and Find Full Text PDF

Background: Wastewater systems are usually considered antibiotic resistance hubs connecting human society and the natural environment. Antibiotic usage can increase the abundance of both ARGs (antibiotic resistance genes) and MGEs (mobile gene elements). Understanding the transcriptomic profiles of ARGs and MGEs remains a major research goal.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!