Comparison of sequencing data processing pipelines and application to underrepresented African human populations.

BMC Bioinformatics

Human Evolution, Department of Organismal Biology, Evolutionary Biology Centre, Uppsala University, Norbyvägen 18C, 752 36, Uppsala, Sweden.

Published: October 2021

Background: Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its "Best Practices" bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification.

Results: We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK "Best Practices" are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called.

Conclusions: We conclude that applying the GATK "Best Practices" pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8502359PMC
http://dx.doi.org/10.1186/s12859-021-04407-xDOI Listing

Publication Analysis

Top Keywords

gatk "best
12
"best practices"
12
sequencing data
8
high-throughput sequencing
8
diverse set
8
variant calling
8
called variants
8
pipelines
7
studies
5
compared
5

Similar Publications

Basic Science and Pathogenesis.

Alzheimers Dement

December 2024

National Centralized Repository for Alzheimer's Disease and Related Dementias (NCRAD), Indianapolis, IN, USA.

Background: The National Centralized Repository for Alzheimer's Disease and Related Dementias (NCRAD) is continuing to develop a bank of induced pluripotent stem cells (iPSCs) that are available by request to the Alzheimer's disease (AD) research community.

Methods: As part of the pipeline for quality control of received cell lines, DNA was extracted for all lines and was submitted for whole genome sequencing (WGS). Paired-end WGS data was generated using the Illumina NovaSeq 6000 and processed following GATK best practices using the Sentieon pipeline.

View Article and Find Full Text PDF

Detection of germline CNVs from gene panel data: benchmarking the state of the art.

Brief Bioinform

November 2024

Hereditary Cancer Program, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge - IDIBELL-ONCOBELL, Avinguda de la Granvia de l'Hospitalet, 199, 08908 L'Hospitalet de Llobregat, Spain.

Germline copy number variants (CNVs) play a significant role in hereditary diseases. However, the accurate detection of CNVs from targeted next-generation sequencing (NGS) gene panel data remains a challenging task. Several tools for calling CNVs within this context have been published to date, but the available benchmarks suffer from limitations, including testing on simulated data, testing on small datasets, and testing a small subset of published tools.

View Article and Find Full Text PDF
Article Synopsis
  • The study investigates population-stratifying and ancestry-informative genetic markers in Indian, Chinese, and wild yak using whole genome resequencing to enhance our understanding of their genetics and ancestry.
  • It analyzes data from 105 yak individuals, identifying over a million high-quality SNP markers, and compares different selection strategies and marker densities to determine the most effective for clustering these populations.
  • The results indicate that a specific marker density (10K) yields the highest genomic breed clustering accuracy, significantly improving estimates of genetic differentiation among the three yak populations.
View Article and Find Full Text PDF
Article Synopsis
  • Intracranial Epidermoid Cysts (IECs) are rare tumors typically treated with surgery, but their complex adherence makes complete removal challenging and often leads to tumor regrowth.
  • Whole Exome Sequencing (WES) was used to analyze IECs and revealed high mutation rates in genes related to immune responses and key signaling pathways, indicating potential mechanisms for immune evasion.
  • Notably, alterations in genes like NOTCH2 and USP8 were frequently found, suggesting they could be new targets for therapy in IEC treatment while also implicating the PI3K-Akt-mTOR pathway in immune escape.
View Article and Find Full Text PDF

Thai pharmacogenomics database -2 (TPGxD-2) sequel to TPGxD-1, analyzing genetic variants in 26 non-VIPGx genes within the Thai population.

Clin Transl Sci

October 2024

Division of Pharmacogenomics and Personalized Medicine, Department of Pathology, Faculty of Medicine Ramathibodi Hospital, Mahidol University, Bangkok, Thailand.

Next-generation sequencing (NGS) has transformed pharmacogenomics (PGx), enabling thorough profiling of pharmacogenes using computational methods and advancing personalized medicine. The Thai Pharmacogenomic Database-2 (TPGxD-2) analyzed 948 whole genome sequences, primarily from the Electricity Generating Authority of Thailand (EGAT) cohort. This study is an extension of the previous Thai Pharmacogenomic Database (TPGxD-1) and specifically focused on 26 non-very important pharmacogenes (VIPGx) genes.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!