Publications by authors named "Xinzhu Wei"

Motivation: While there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement - yet fast and small - is helpful for research on highly scalable bioinformatics.

View Article and Find Full Text PDF

Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. However, encoding genetic data in existing tabular data structures and file formats has become costly and unsustainable. Here we introduce the genotype representation graph (GRG), a fully connected hierarchical data structure that losslessly encodes phased whole-genome polymorphisms.

View Article and Find Full Text PDF

Selective sweeps describe the process by which an adaptive mutation arises and rapidly fixes in the population, thereby removing genetic variation in its genomic vicinity. The expected signatures of selective sweeps are relatively well understood in panmictic population models, yet natural populations often extend across larger geographic ranges where individuals are more likely to mate with those born nearby. To investigate how such spatial population structure can affect sweep dynamics and signatures, we simulated selective sweeps in populations inhabiting a two-dimensional continuous landscape.

View Article and Find Full Text PDF

Computational analysis of a large number of genomes requires a data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism data using tabular data structures and file formats, where rows and columns represent samples and genetic variants. However, encoding genetic data in such formats has become unsustainable.

View Article and Find Full Text PDF

Population genetic studies often rely on artificial genomes (AGs) simulated by generative models of genetic data. In recent years, unsupervised learning models, based on hidden Markov models, deep generative adversarial networks, restricted Boltzmann machines, and variational autoencoders, have gained popularity due to their ability to generate AGs closely resembling empirical data. These models, however, present a tradeoff between expressivity and tractability.

View Article and Find Full Text PDF

The genetic variants introduced into the ancestors of modern humans from interbreeding with Neanderthals have been suggested to contribute an unexpected extent to complex human traits. However, testing this hypothesis has been challenging due to the idiosyncratic population genetic properties of introgressed variants. We developed rigorous methods to assess the contribution of introgressed Neanderthal variants to heritable trait variation and applied these methods to analyze 235,592 introgressed Neanderthal variants and 96 distinct phenotypes measured in about 300,000 unrelated white British individuals in the UK Biobank.

View Article and Find Full Text PDF

The ancestral recombination graph is a structure that describes the joint genealogies of sampled DNA sequences along the genome. Recent computational methods have made impressive progress toward scalably estimating whole-genome genealogies. In addition to inferring the ancestral recombination graph, some of these methods can also provide ancestral recombination graphs sampled from a defined posterior distribution.

View Article and Find Full Text PDF
Article Synopsis
  • Understanding the emergence of new viruses hinges on thoroughly annotating their genomes, particularly focusing on overlapping genes (OLGs) commonly found in viruses, like SARS-CoV-2.
  • Researchers identified a novel OLG in SARS-CoV-2 that appears in Guangxi pangolin-CoVs but not in other related viruses, and they analyzed its translation and protein sequence across different evolutionary contexts.
  • This OLG has been mistakenly classified, leading to confusion in research, but it has been shown to trigger a strong antibody response in COVID-19 patients, emphasizing the critical role of OLGs in viral evolution and pandemics.
View Article and Find Full Text PDF
Article Synopsis
  • Purifying natural selection helps identify functional biological sequences in protein-coding genes, using a measure called dN/dS (the ratio of nonsynonymous to synonymous substitutions).
  • Overlapping genes (OLGs) complicate this analysis since changes that are synonymous for one gene may not be for the other, making it necessary to develop new methods for evaluating these constraints.
  • The proposed tool, OLGenie, offers an enhanced method for identifying true OLGs with high accuracy and has been successfully tested on viral genomes, including a significant analysis of an HIV-1 gene, highlighting the potential for further studies in genome annotation.
View Article and Find Full Text PDF

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

View Article and Find Full Text PDF

We use the genotyping and death register information of 409,693 individuals of British ancestry to investigate fitness effects of the CCR5-∆32 mutation. We estimate a 21% increase in the all-cause mortality rate in individuals who are homozygous for the ∆32 allele. A deleterious effect of the ∆32/∆32 mutation is also independently supported by a significant deviation from the Hardy-Weinberg equilibrium (HWE) due to a deficiency of ∆32/∆32 individuals at the time of recruitment.

View Article and Find Full Text PDF

Diminishing returns epistasis causes the benefit of the same advantageous mutation smaller in fitter genotypes and is frequently observed in experimental evolution. However, its occurrence in other contexts, environment dependence, and mechanistic basis are unclear. Here, we address these questions using 1,005 sequenced segregants generated from a yeast cross.

View Article and Find Full Text PDF

Maximum growth rate per individual (r) and carrying capacity (K) are key life-history traits that together characterize the density-dependent population growth and therefore are crucial parameters of many ecological and evolutionary theories such as r/K selection. Although r and K are generally thought to correlate inversely, both r/K tradeoffs and trade-ups have been observed. Nonetheless, neither the conditions under which each of these relationships occur nor the causes of these relationships are fully understood.

View Article and Find Full Text PDF
Article Synopsis
  • Theory suggests that individual fitness is highest when the genetic distance between parents is balanced—not too small or too large.
  • Research evaluating fungal, plant, and animal hybrids shows that fitness traits follow a humped curve based on mating distance, revealing an optimal mating distance (OMD).
  • OMDs are closer to the species' nucleotide diversity than maximum genetic distances, indicating that heterosis benefits can be diminished by genetic incompatibility, impacting areas like speciation, conservation, and agriculture.
View Article and Find Full Text PDF

Ribosomes are highly abundant in cells and comprise, besides RNAs of varying lengths, 55-80 similarly sized, short proteins. This seemingly unusual composition is thought to have resulted from selection for rapid autocatalytic ribosome production. Here, we demonstrate that ribosomal protein-splitting mutations cannot accelerate ribosome production.

View Article and Find Full Text PDF

Robustness and evolvability are fundamental characteristics of life whose relationship has intrigued generations of biologists. Studies of several genotype-phenotype maps (GPMs) such as the map between short DNA sequences and their bindings to transcription factors showed that phenotype robustness (PR) promotes phenotype evolvability (PE), but the underlying reason is unclear. Here, we show mathematically that the expected PE is a monotonically increasing function of the expected PR in random GPMs.

View Article and Find Full Text PDF

Gene-environment interaction (G×E) refers to the phenomenon that the same mutation has different phenotypic effects in different environments. Although quantitative trait loci (QTLs) exhibiting G×E have been reported, little is known about the general properties of G×E, and those of its underlying QTLs. Here, we use the genotypes of 1005 segregants from a cross between two Saccharomyces cerevisiae strains, and the growth rates of these segregants in 47 environments, to identify growth rate QTLs (gQTLs) in each environment, and QTLs that have different growth effects in each pair of environments (g×eQTLs) .

View Article and Find Full Text PDF

Overlapping genes, where one DNA sequence codes for two proteins with different reading frames, are not uncommon in viruses and cellular organisms. Estimating the direction and strength of natural selection acting on overlapping genes is important for understanding their functionality, origin, evolution, maintenance, and potential interaction. However, the standard methods for estimating synonymous (dS) and nonsynonymous (dN) nucleotide substitution rates are inapplicable here because a nucleotide change can be simultaneously synonymous and nonsynonymous when both reading frames involved are considered.

View Article and Find Full Text PDF

The way population size, population structure (with migration), and spatially dependent selection (where there is no globally optimal allele), combine to affect the substitution rate is poorly understood. Here, we consider a two patch model where mutant alleles are beneficial in one patch and deleterious in the other patch. We assume that the spatial average of selection on mutant alleles is zero.

View Article and Find Full Text PDF