Population scale sequencing of whole human genomes is becoming economically feasible; however, data management and analysis remains a formidable challenge for many research groups. Large sequencing studies, like the 1000 Genomes Project, have improved our understanding of human demography and the effect of rare genetic variation in disease. Variant calling on datasets of hundreds or thousands of genomes is time-consuming, expensive, and not easily reproducible given the myriad components of a variant calling pipeline. Here, we describe a cloud-based pipeline for joint variant calling in large samples using the Real Time Genomics population caller. We deployed the population caller on the Amazon cloud with the DNAnexus platform in order to achieve low-cost variant calling. Using our pipeline, we were able to identify 68.3 million variants in 2,535 samples from Phase 3 of the 1000 Genomes Project. By performing the variant calling in a parallel manner, the data was processed within 5 days at a compute cost of $7.33 per sample (a total cost of $18,590 for completed jobs and $21,805 for all jobs). Analysis of cost dependence and running time on the data size suggests that, given near linear scalability, cloud computing can be a cheap and efficient platform for analyzing even larger sequencing studies in the future.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4482534PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0129277PLOS

Publication Analysis

Top Keywords

variant calling
24
human genomes
8
sequencing studies
8
1000 genomes
8
genomes project
8
calling pipeline
8
population caller
8
variant
6
calling
6
genomes
5

Similar Publications

Background: Pacific Biosciences (PacBio) circular consensus sequencing (CCS), also known as high fidelity (HiFi) technology, has revolutionized modern genomics by producing long (10 + kb) and highly accurate reads. This is achieved by sequencing circularized DNA molecules multiple times and combining them into a consensus sequence. Currently, the accuracy and quality value estimation provided by HiFi technology are more than sufficient for applications such as genome assembly and germline variant calling.

View Article and Find Full Text PDF

Genomic and phenotypic correlates of mosaic loss of chromosome Y in blood.

Am J Hum Genet

January 2025

Division of Biostatistics, Data Science Institute, Medical College of Wisconsin, Milwaukee, WI, USA; Cancer Center, Medical College of Wisconsin, Milwaukee, WI, USA. Electronic address:

Mosaic loss of Y (mLOY) is the most common somatic chromosomal alteration detected in human blood. The presence of mLOY is associated with altered blood cell counts and increased risk of Alzheimer disease, solid tumors, and other age-related diseases. We sought to gain a better understanding of genetic drivers and associated phenotypes of mLOY through analyses of whole-genome sequencing (WGS) of a large set of genetically diverse males from the Trans-Omics for Precision Medicine (TOPMed) program.

View Article and Find Full Text PDF

Primary ciliary dyskinesia (PCD, OMIM 244400) is a rare genetic disorder that affects motile cilia and is characterised by impaired mucociliary clearance of the airway epithelium, which results in chronic upper and lower airway infections. While short-read next-generation sequencing technology has been used for the genetic testing of PCD, its effectiveness is limited in identifying variants in the gene because of the nearly identical pseudogene As we confirmed that the gene was not expressed in airway cells, we obtained nasal mucosa biopsy specimens for total RNA sequencing (RNA-seq) with library enrichment using exome oligos. Among the 34 nasal samples from patients suspected of having PCD, three aberrant splicing patterns in were identified in two samples.

View Article and Find Full Text PDF

Clair3-RNA: A deep learning-based small variant caller for long-read RNA sequencing data.

bioRxiv

January 2025

Department of Computer Science, School of Computing and Data Science, University of Hong Kong, Hong Kong, China.

Variant calling using long-read RNA sequencing (lrRNA-seq) can be applied to diverse tasks, such as capturing full-length isoforms and gene expression profiling. It poses challenges, however, due to higher error rates than DNA data, the complexities of transcript diversity, RNA editing events, etc. In this paper, we propose Clair3-RNA, the first deep learning-based variant caller tailored for lrRNA-seq data.

View Article and Find Full Text PDF

Background And Aims: Familial hypercholesterolemia (FH) and other disorders with similar features are common genetic disorders that remain underdiagnosed and undertreated, due in part to the cost of screening. The aim of this study was to design and implement a whole gene targeted NGS panel for the molecular diagnosis of FH and statin intolerance with an emphasis on high quality variant calling, including copy number analysis.

Methods: A whole gene panel for hybridisation-based short read NGS was designed for the dominant FH-genes low density lipoprotein receptor (), apolipoprotein B (APOB), proproteinconvertas subtilisin/kexin type 9 (PCSK9), apolipoprotein E (APOE) and the recessive FH-genes low density lipoprotein receptor adaptor protein 1 (), ATP binding cassette subfamily member 5/8 (ABCG5/8) and lipase A, lysosomal acid type (), as well as solute carrier organic anion transporter family member 1B1 (), not an FH gene but linked to statin intolerance.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!