Principal component analysis (PCA) is widely used to control for population structure in genome-wide association studies (GWAS). Top principal components (PCs) typically reflect population structure, but challenges arise in deciding how many PCs are needed and ensuring that PCs do not capture other artifacts such as regions with atypical linkage disequilibrium (LD). In response to the latter, many groups suggest performing LD pruning or excluding known high LD regions prior to PCA.
View Article and Find Full Text PDFBackground: Chronic kidney disease (CKD) is highly prevalent in Central America, and genetic factors may contribute to CKD risk. To understand the influences of genetic admixture on CKD susceptibility, we conducted an admixture mapping screening of CKD traits and risk factors in US Hispanic and Latino individuals from Central America country of origin.
Methods: We analyzed 1023 participants of HCHS/SOL (Hispanic Community Health Study/Study of Latinos) who reported 4 grandparents originating from the same Central America country.
Genotype data include errors that may influence conclusions reached by downstream statistical analyses. Previous studies have estimated genotype error rates from discrepancies in human pedigree data, such as Mendelian inconsistent genotypes or apparent phase violations. However, uncalled deletions, which generally have not been accounted for in these studies, can lead to biased error rate estimates.
View Article and Find Full Text PDFPrincipal component analysis (PCA) is widely used to control for population structure in genome-wide association studies (GWAS). Top principal components (PCs) typically reflect population structure, but challenges arise in deciding how many PCs are needed and ensuring that PCs do not capture other artifacts such as regions with atypical linkage disequilibrium (LD). In response to the latter, many groups suggest performing LD pruning or excluding known high LD regions prior to PCA.
View Article and Find Full Text PDFWe present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more computationally efficient inference of identity by descent (IBD) than approaches that infer pairwise IBD segments and provides locus-specific IBD clusters rather than IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset.
View Article and Find Full Text PDFWe present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more efficient collection and storage of identity by descent (IBD) information than approaches that detect and store pairwise IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset.
View Article and Find Full Text PDFThe effective size of a population (Ne) in the recent past can be estimated through analysis of identity-by-descent (IBD) segments. Several methods have been developed for estimating Ne from autosomal IBD segments, but no such effort has been made with X chromosome IBD segments. In this work, we propose a method to estimate the X chromosome effective population size from X chromosome IBD segments.
View Article and Find Full Text PDFAlzheimer disease (AD) is the most common form of senile dementia, with high incidence late in life in many populations including Caribbean Hispanic (CH) populations. Such admixed populations, descended from more than one ancestral population, can present challenges for genetic studies, including limited sample sizes and unique analytical constraints. Therefore, CH populations and other admixed populations have not been well represented in studies of AD, and much of the genetic variation contributing to AD risk in these populations remains unknown.
View Article and Find Full Text PDFLocal ancestry is the source ancestry at each point in the genome of an admixed individual. Inferred local ancestry is used for admixture mapping and population genetic analyses. We present FLARE (fast local ancestry estimation), a method for local ancestry inference.
View Article and Find Full Text PDFThe first release of UK Biobank whole-genome sequence data contains 150,119 genomes. We present an open-source pipeline for filtering, phasing, and indexing these genomes on the cloud-based UK Biobank Research Analysis Platform. This pipeline makes it possible to apply haplotype-based methods to UK Biobank whole-genome sequence data.
View Article and Find Full Text PDFAm J Hum Genet
December 2022
We provide a method for estimating the genome-wide mutation rate from sequence data on unrelated individuals by using segments of identity by descent (IBD). The length of an IBD segment indicates the time to shared ancestor of the segment, and mutations that have occurred since the shared ancestor result in discordances between the two IBD haplotypes. Previous methods for IBD-based estimation of mutation rate have required the use of family data for accurate phasing of the genotypes.
View Article and Find Full Text PDFHaplotypes can be estimated from unphased genotype data via statistical methods. When parent-offspring trios are available for inferring the true phase from Mendelian inheritance rules, the accuracy of statistical phasing is usually measured by the switch error rate, which is the proportion of pairs of consecutive heterozygotes that are incorrectly phased. We present a method for estimating the genotype error rate from parent-offspring trios and a method for estimating the bias that occurs in the observed switch error rate as a result of genotype error.
View Article and Find Full Text PDFAllele frequency estimates in admixed populations, such as Hispanics and Latinos, rely on the sample's specific admixture composition and thus may differ between two seemingly similar populations. However, ancestry-specific allele frequencies, i.e.
View Article and Find Full Text PDFHaplotype phasing is the estimation of haplotypes from genotype data. We present a fast, accurate, and memory-efficient haplotype phasing method that scales to large-scale SNP array and sequence data. The method uses marker windowing and composite reference haplotypes to reduce memory usage and computation time.
View Article and Find Full Text PDFThe SPrime program detects the variants in current-day populations that were introgressed from an archaic source in the past. It is optimized for detecting introgression from Neanderthals and Denisovans in modern humans. We provide a protocol for detecting Neanderthal and Denisovan introgression in 1000 Genomes Project data, specifically focusing on the CHB (Han Chinese in Beijing) population.
View Article and Find Full Text PDFMotivation: Estimation of pairwise kinship coefficients in large datasets is computationally challenging because the number of related individuals increases quadratically with sample size.
Results: We present IBDkin, a software package written in C for estimating kinship coefficients from identity by descent (IBD) segments. We use IBDkin to estimate kinship coefficients for 7.
Recombination rates vary significantly across the genome, and estimates of recombination rates are needed for downstream analyses such as haplotype phasing and genotype imputation. Existing methods for recombination rate estimation are limited by insufficient amounts of informative genetic data or by high computational cost. We present a method and software, called IBDrecomb, for using segments of identity by descent to infer recombination rates.
View Article and Find Full Text PDFArchaeological studies estimate the initial settlement of Samoa at 2,750 to 2,880 y ago and identify only limited settlement and human modification to the landscape until about 1,000 to 1,500 y ago. At this point, a complex history of migration is thought to have begun with the arrival of people sharing ancestry with Near Oceanic groups (i.e.
View Article and Find Full Text PDFSegments of identity by descent (IBD) are used in many genetic analyses. We present a method for detecting identical-by-descent haplotype segments in phased genotype data. Our method, called hap-IBD, combines a compressed representation of haplotype data, the positional Burrows-Wheeler transform, and multi-threaded execution to produce very fast analysis times.
View Article and Find Full Text PDF