Motivation: While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement - yet fast and small - is helpful for research on highly scalable bioinformatics.
View Article and Find Full Text PDFNat Comput Sci
February 2025
Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. However, encoding genetic data in existing tabular data structures and file formats has become costly and unsustainable. Here we introduce the genotype representation graph (GRG), a fully connected hierarchical data structure that losslessly encodes phased whole-genome polymorphisms.
View Article and Find Full Text PDFSelective sweeps describe the process by which an adaptive mutation arises and rapidly fixes in the population, thereby removing genetic variation in its genomic vicinity. The expected signatures of selective sweeps are relatively well understood in panmictic population models, yet natural populations often extend across larger geographic ranges where individuals are more likely to mate with those born nearby. To investigate how such spatial population structure can affect sweep dynamics and signatures, we simulated selective sweeps in populations inhabiting a two-dimensional continuous landscape.
View Article and Find Full Text PDFComputational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism data using tabular data structures and file formats, where rows and columns represent samples and genetic variants. However, encoding genetic data in such formats has become unsustainable.
View Article and Find Full Text PDFPopulation genetic studies often rely on artificial genomes (AGs) simulated by generative models of genetic data. In recent years, unsupervised learning models, based on hidden Markov models, deep generative adversarial networks, restricted Boltzmann machines, and variational autoencoders, have gained popularity due to their ability to generate AGs closely resembling empirical data. These models, however, present a tradeoff between expressivity and tractability.
View Article and Find Full Text PDFThe genetic variants introduced into the ancestors of modern humans from interbreeding with Neanderthals have been suggested to contribute an unexpected extent to complex human traits. However, testing this hypothesis has been challenging due to the idiosyncratic population genetic properties of introgressed variants. We developed rigorous methods to assess the contribution of introgressed Neanderthal variants to heritable trait variation and applied these methods to analyze 235,592 introgressed Neanderthal variants and 96 distinct phenotypes measured in about 300,000 unrelated white British individuals in the UK Biobank.
View Article and Find Full Text PDFThe ancestral recombination graph is a structure that describes the joint genealogies of sampled DNA sequences along the genome. Recent computational methods have made impressive progress toward scalably estimating whole-genome genealogies. In addition to inferring the ancestral recombination graph, some of these methods can also provide ancestral recombination graphs sampled from a defined posterior distribution.
View Article and Find Full Text PDFAn amendment to this paper has been published and can be accessed via a link at the top of the paper.
View Article and Find Full Text PDFWe use the genotyping and death register information of 409,693 individuals of British ancestry to investigate fitness effects of the CCR5-∆32 mutation. We estimate a 21% increase in the all-cause mortality rate in individuals who are homozygous for the ∆32 allele. A deleterious effect of the ∆32/∆32 mutation is also independently supported by a significant deviation from the Hardy-Weinberg equilibrium (HWE) due to a deficiency of ∆32/∆32 individuals at the time of recruitment.
View Article and Find Full Text PDFDiminishing returns epistasis causes the benefit of the same advantageous mutation smaller in fitter genotypes and is frequently observed in experimental evolution. However, its occurrence in other contexts, environment dependence, and mechanistic basis are unclear. Here, we address these questions using 1,005 sequenced segregants generated from a yeast cross.
View Article and Find Full Text PDFMaximum growth rate per individual (r) and carrying capacity (K) are key life-history traits that together characterize the density-dependent population growth and therefore are crucial parameters of many ecological and evolutionary theories such as r/K selection. Although r and K are generally thought to correlate inversely, both r/K tradeoffs and trade-ups have been observed. Nonetheless, neither the conditions under which each of these relationships occur nor the causes of these relationships are fully understood.
View Article and Find Full Text PDFRibosomes are highly abundant in cells and comprise, besides RNAs of varying lengths, 55-80 similarly sized, short proteins. This seemingly unusual composition is thought to have resulted from selection for rapid autocatalytic ribosome production. Here, we demonstrate that ribosomal protein-splitting mutations cannot accelerate ribosome production.
View Article and Find Full Text PDFRobustness and evolvability are fundamental characteristics of life whose relationship has intrigued generations of biologists. Studies of several genotype-phenotype maps (GPMs) such as the map between short DNA sequences and their bindings to transcription factors showed that phenotype robustness (PR) promotes phenotype evolvability (PE), but the underlying reason is unclear. Here, we show mathematically that the expected PE is a monotonically increasing function of the expected PR in random GPMs.
View Article and Find Full Text PDFGene-environment interaction (G×E) refers to the phenomenon that the same mutation has different phenotypic effects in different environments. Although quantitative trait loci (QTLs) exhibiting G×E have been reported, little is known about the general properties of G×E, and those of its underlying QTLs. Here, we use the genotypes of 1005 segregants from a cross between two Saccharomyces cerevisiae strains, and the growth rates of these segregants in 47 environments, to identify growth rate QTLs (gQTLs) in each environment, and QTLs that have different growth effects in each pair of environments (g×eQTLs) .
View Article and Find Full Text PDFGenome Biol Evol
December 2014
Overlapping genes, where one DNA sequence codes for two proteins with different reading frames, are not uncommon in viruses and cellular organisms. Estimating the direction and strength of natural selection acting on overlapping genes is important for understanding their functionality, origin, evolution, maintenance, and potential interaction. However, the standard methods for estimating synonymous (dS) and nonsynonymous (dN) nucleotide substitution rates are inapplicable here because a nucleotide change can be simultaneously synonymous and nonsynonymous when both reading frames involved are considered.
View Article and Find Full Text PDFThe way population size, population structure (with migration), and spatially dependent selection (where there is no globally optimal allele), combine to affect the substitution rate is poorly understood. Here, we consider a two patch model where mutant alleles are beneficial in one patch and deleterious in the other patch. We assume that the spatial average of selection on mutant alleles is zero.
View Article and Find Full Text PDF