Whole-genome sequencing (WGS), whole-exome sequencing (WES) and array genotyping with imputation (IMP) are common strategies for assessing genetic variation and its association with medically relevant phenotypes. To date, there has been no systematic empirical assessment of the yield of these approaches when applied to hundreds of thousands of samples to enable the discovery of complex trait genetic signals. Using data for 100 complex traits from 149,195 individuals in the UK Biobank, we systematically compare the relative yield of these strategies in genetic association studies.
View Article and Find Full Text PDFRapa Nui (also known as Easter Island) is one of the most isolated inhabited places in the world. It has captured the imagination of many owing to its archaeological record, which includes iconic megalithic statues called moai. Two prominent contentions have arisen from the extensive study of Rapa Nui.
View Article and Find Full Text PDFHaplotype estimation, or phasing, has gained significant traction in large-scale projects due to its valuable contributions to population genetics, variant analysis, and the creation of reference panels for imputation and phasing of new samples. To scale with the growing number of samples, haplotype estimation methods designed for population scale rely on highly optimized statistical models to phase genotype data, and usually ignore read-level information. Statistical methods excel in resolving common variants, however, they still struggle at rare variants due to the lack of statistical information.
View Article and Find Full Text PDFFor many genome-wide association studies, imputing genotypes from a haplotype reference panel is a necessary step. Over the past 15 years, reference panels have become larger and more diverse, leading to improvements in imputation accuracy. However, the latest generation of reference panels is subject to restrictions on data sharing due to concerns about privacy, limiting their usefulness for genotype imputation.
View Article and Find Full Text PDFLow-coverage imputation is becoming ever more present in ancient DNA (aDNA) studies. Imputation pipelines commonly used for present-day genomes have been shown to yield accurate results when applied to ancient genomes. However, post-mortem damage (PMD), in the form of C-to-T substitutions at the reads termini, and contamination with DNA from closely related species can potentially affect imputation performance in aDNA.
View Article and Find Full Text PDFMajor migration events in Holocene Eurasia have been characterized genetically at broad regional scales. However, insights into the population dynamics in the contact zones are hampered by a lack of ancient genomic data sampled at high spatiotemporal resolution. Here, to address this, we analysed shotgun-sequenced genomes from 100 skeletons spanning 7,300 years of the Mesolithic period, Neolithic period and Early Bronze Age in Denmark and integrated these with proxies for diet (C and N content), mobility (Sr/Sr ratio) and vegetation cover (pollen).
View Article and Find Full Text PDFIntroduction: We have reanalyzed the genomic data of the International Collaboration for the Genomics of HIV (ICGH), centering on HIV-1 Elite Controllers.
Methods: We performed a genome-wide Association Study comparing 543 HIV Elite Controllers with 3,272 uninfected controls of European descent. Using the latest database for imputation, we analyzed 35,552 Single Nucleotide Polymorphisms (SNPs) within the Major Histocompatibility Complex () region.
The release of 150,119 UK Biobank sequences represents an unprecedented opportunity as a reference panel to impute low-coverage whole-genome sequencing data with high accuracy but current methods cannot cope with the size of the data. Here we introduce GLIMPSE2, a low-coverage whole-genome sequencing imputation method that scales sublinearly in both the number of samples and markers, achieving efficient whole-genome imputation from the UK Biobank reference panel while retaining high accuracy for ancient and modern genomes, particularly at rare variants and for very low-coverage samples.
View Article and Find Full Text PDFPhasing involves distinguishing the two parentally inherited copies of each chromosome into haplotypes. Here, we introduce SHAPEIT5, a new phasing method that quickly and accurately processes large sequencing datasets and applied it to UK Biobank (UKB) whole-genome and whole-exome sequencing data. We demonstrate that SHAPEIT5 phases rare variants with low switch error rates of below 5% for variants present in just 1 sample out of 100,000.
View Article and Find Full Text PDFDue to postmortem DNA degradation and microbial colonization, most ancient genomes have low depth of coverage, hindering genotype calling. Genotype imputation can improve genotyping accuracy for low-coverage genomes. However, it is unknown how accurate ancient DNA imputation is and whether imputation introduces bias to downstream analyses.
View Article and Find Full Text PDFNon-coding regulatory elements such as enhancers are key in controlling the cell-type specificity and spatio-temporal expression of genes. To drive stable and precise gene transcription robust to genetic variation and environmental stress, genes are often targeted by multiple enhancers with redundant action. However, it is unknown whether enhancers targeting the same gene display simultaneous activity or whether some enhancer combinations are more often co-active than others.
View Article and Find Full Text PDFStudying the interplay between genetic variation, epigenetic changes, and regulation of gene expression is crucial to understand the modification of cellular states in various conditions, including immune diseases. In this study, we characterize the cell-specificity in three key cells of the human immune system by building cis maps of regulatory regions with coordinated activity (CRDs) from ChIP-seq peaks and methylation data. We find that only 33% of CRD-gene associations are shared between cell types, revealing how similarly located regulatory regions provide cell-specific modulation of gene activity.
View Article and Find Full Text PDFSummary: The positional Burrows-Wheeler transform (PBWT) data structure allows for efficient haplotype data matching and compression. Its performance makes it a powerful tool for bioinformatics. However, existing algorithms do not exploit parallelism due to inner dependencies.
View Article and Find Full Text PDFSample recruitment for research consortia, biobanks, and personal genomics companies span years, necessitating genotyping in batches, using different technologies. As marker content on genotyping arrays varies, integrating such datasets is non-trivial and its impact on haplotype estimation (phasing) and whole genome imputation, necessary steps for complex trait analysis, remains under-evaluated. Using the iPSYCH dataset, comprising 130,438 individuals, genotyped in two stages, on different arrays, we evaluated phasing and imputation performance across multiple phasing methods and data integration protocols.
View Article and Find Full Text PDFIdentical genetic variations can have different phenotypic effects depending on their parent of origin. Yet, studies focusing on parent-of-origin effects have been limited in terms of sample size due to the lack of parental genomes or known genealogies. We propose a probabilistic approach to infer the parent-of-origin of individual alleles that does not require parental genomes nor prior knowledge of genealogy.
View Article and Find Full Text PDFThe SARS-CoV-2 pandemic has differentially impacted populations across race and ethnicity. A multi-omic approach represents a powerful tool to examine risk across multi-ancestry genomes. We leverage a pandemic tracking strategy in which we sequence viral and host genomes and transcriptomes from nasopharyngeal swabs of 1049 individuals (736 SARS-CoV-2 positive and 313 SARS-CoV-2 negative) and integrate them with digital phenotypes from electronic health records from a diverse catchment area in Northern California.
View Article and Find Full Text PDFMost human genes are co-expressed with a nearby gene. Previous studies have revealed this local gene co-expression to be widespread across chromosomes and across dozens of tissues. Yet, so far these studies used bulk RNA-seq, averaging gene expression measurements across millions of cells, thus being unclear if this co-expression stems from transcription events in single cells.
View Article and Find Full Text PDFMotivation: Generation of genotype data has been growing exponentially over the last decade. With the large size of recent datasets comes a storage and computational burden with ever increasing costs. To reduce this burden, we propose XSI, a file format with reduced storage footprint that also allows computation on the compressed data and we show how this can improve future analyses.
View Article and Find Full Text PDFThe combined analysis of haplotype panels with phenotype clinical cohorts is a common approach to explore the genetic architecture of human diseases. However, genetic studies are mainly based on single nucleotide variants (SNVs) and small insertions and deletions (indels). Here, we contribute to fill this gap by generating a dense haplotype map focused on the identification, characterization, and phasing of structural variants (SVs).
View Article and Find Full Text PDFComparing transcript levels between healthy and diseased individuals allows the identification of differentially expressed genes, which may be causes, consequences or mere correlates of the disease under scrutiny. We propose a method to decompose the observational correlation between gene expression and phenotypes driven by confounders, forward- and reverse causal effects. The bi-directional causal effects between gene expression and complex traits are obtained by Mendelian Randomization integrating summary-level data from GWAS and whole-blood eQTLs.
View Article and Find Full Text PDFNearby genes are often expressed as a group. Yet, the prevalence, molecular mechanisms and genetic control of local gene co-expression are far from being understood. Here, by leveraging gene expression measurements across 49 human tissues and hundreds of individuals, we find that local gene co-expression occurs in 13% to 53% of genes per tissue.
View Article and Find Full Text PDF