Background: Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its "Best Practices" bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification.
Results: We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK "Best Practices" are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called.
Conclusions: We conclude that applying the GATK "Best Practices" pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8502359 | PMC |
http://dx.doi.org/10.1186/s12859-021-04407-x | DOI Listing |
Alzheimers Dement
December 2024
National Centralized Repository for Alzheimer's Disease and Related Dementias (NCRAD), Indianapolis, IN, USA.
Background: The National Centralized Repository for Alzheimer's Disease and Related Dementias (NCRAD) is continuing to develop a bank of induced pluripotent stem cells (iPSCs) that are available by request to the Alzheimer's disease (AD) research community.
Methods: As part of the pipeline for quality control of received cell lines, DNA was extracted for all lines and was submitted for whole genome sequencing (WGS). Paired-end WGS data was generated using the Illumina NovaSeq 6000 and processed following GATK best practices using the Sentieon pipeline.
Brief Bioinform
November 2024
Hereditary Cancer Program, Catalan Institute of Oncology, Institut d'Investigació Biomèdica de Bellvitge - IDIBELL-ONCOBELL, Avinguda de la Granvia de l'Hospitalet, 199, 08908 L'Hospitalet de Llobregat, Spain.
Germline copy number variants (CNVs) play a significant role in hereditary diseases. However, the accurate detection of CNVs from targeted next-generation sequencing (NGS) gene panel data remains a challenging task. Several tools for calling CNVs within this context have been published to date, but the available benchmarks suffer from limitations, including testing on simulated data, testing on small datasets, and testing a small subset of published tools.
View Article and Find Full Text PDFBMC Genomics
November 2024
ICAR-Indian Veterinary Research Institute, Izatnagar, Bareilly, 243122, India.
Cancers (Basel)
October 2024
Department of Neurological Surgery, University of Washington Medical Center 1, Seattle, WA 98195, USA.
Clin Transl Sci
October 2024
Division of Pharmacogenomics and Personalized Medicine, Department of Pathology, Faculty of Medicine Ramathibodi Hospital, Mahidol University, Bangkok, Thailand.
Next-generation sequencing (NGS) has transformed pharmacogenomics (PGx), enabling thorough profiling of pharmacogenes using computational methods and advancing personalized medicine. The Thai Pharmacogenomic Database-2 (TPGxD-2) analyzed 948 whole genome sequences, primarily from the Electricity Generating Authority of Thailand (EGAT) cohort. This study is an extension of the previous Thai Pharmacogenomic Database (TPGxD-1) and specifically focused on 26 non-very important pharmacogenes (VIPGx) genes.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!