Nucleotide changes in gene regulatory elements are important determinants of neuronal development and diseases. Using massively parallel reporter assays in primary human cells from mid-gestation cortex and cerebral organoids, we interrogated the cis-regulatory activity of 102,767 open chromatin regions, including thousands of sequences with cell type-specific accessibility and variants associated with brain gene regulation. In primary cells, we identified 46,802 active enhancer sequences and 164 variants that alter enhancer activity.
View Article and Find Full Text PDFHuman accelerated regions (HARs) are conserved genomic loci that evolved at an accelerated rate in the human lineage and may underlie human-specific traits. We generated HARs and chimpanzee accelerated regions with an automated pipeline and an alignment of 241 mammalian genomes. Combining deep learning with chromatin capture experiments in human and chimpanzee neural progenitor cells, we discovered a significant enrichment of HARs in topologically associating domains containing human-specific genomic variants that change three-dimensional (3D) genome organization.
View Article and Find Full Text PDFNucleotide changes in gene regulatory elements are important determinants of neuronal development and disease. Using massively parallel reporter assays in primary human cells from mid-gestation cortex and cerebral organoids, we interrogated the -regulatory activity of 102,767 sequences, including differentially accessible cell-type specific regions in the developing cortex and single-nucleotide variants associated with psychiatric disorders. In primary cells, we identified 46,802 active enhancer sequences and 164 disorder-associated variants that significantly alter enhancer activity.
View Article and Find Full Text PDFBackground: Association of chromatin with lamin proteins at the nuclear periphery has emerged as a potential mechanism to coordinate cell type-specific gene expression and maintain cellular identity via gene silencing. Unlike many histone modifications and chromatin-associated proteins, lamina-associated domains (LADs) are mapped genome-wide in relatively few genetically normal human cell types, which limits our understanding of the role peripheral chromatin plays in development and disease.
Results: To address this gap, we map LAMIN B1 occupancy across twelve human cell types encompassing pluripotent stem cells, intermediate progenitors, and differentiated cells from all three germ layers.
Using machine learning (ML), we interrogated the function of all human-chimpanzee variants in 2,645 human accelerated regions (HARs), finding 43% of HARs have variants with large opposing effects on chromatin state and 14% on neurodevelopmental enhancer activity. This pattern, consistent with compensatory evolution, was confirmed using massively parallel reporter assays in chimpanzee and human neural progenitor cells. The species-specific enhancer activity of HARs was accurately predicted from the presence and absence of transcription factor footprints in each species.
View Article and Find Full Text PDFHuman accelerated regions (HARs) are the fastest-evolving sequences in the human genome. When HARs were discovered in 2006, their function was mysterious due to scant annotation of the noncoding genome. Diverse technologies, from transgenic animals to machine learning, have consistently shown that HARs function as gene regulatory enhancers with significant enrichment in neurodevelopment.
View Article and Find Full Text PDFDeleterious genetic variants in POGZ, which encodes the chromatin regulator Pogo Transposable Element with ZNF Domain protein, are strongly associated with autism spectrum disorder (ASD). Although it is a high-confidence ASD risk gene, the neurodevelopmental functions of POGZ remain unclear. Here we reveal the genomic binding of POGZ in the developing forebrain at euchromatic loci and gene regulatory elements (REs).
View Article and Find Full Text PDFThe scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics.
View Article and Find Full Text PDFAbnormal P-wave axis may reflect preclinical atrial dysfunction and has been associated with an increased risk of incident atrial fibrillation (AF) in the general population. Patients with diabetes mellitus (DM) have a higher prevalence of AF, but the association of abnormal P-wave axis and the risk of incident AF in those with diabetes has not been previously explored. For this analysis, we included 8,965 eligible participants from the Action to Control Cardiovascular Risk in Diabetes (ACCORD) trial.
View Article and Find Full Text PDFMassively parallel reporter assays (MPRAs) can simultaneously measure the function of thousands of candidate regulatory sequences (CRSs) in a quantitative manner. In this method, CRSs are cloned upstream of a minimal promoter and reporter gene, alongside a unique barcode, and introduced into cells. If the CRS is a functional regulatory element, it will lead to the transcription of the barcode sequence, which is measured via RNA sequencing and normalized for cellular integration via DNA sequencing of the barcode.
View Article and Find Full Text PDFTo discover regulatory elements driving the specificity of gene expression in different cell types and regions of the developing human brain, we generated an atlas of open chromatin from nine dissected regions of the mid-gestation human telencephalon, as well as microdissected upper and deep layers of the prefrontal cortex. We identified a subset of open chromatin regions (OCRs), termed predicted regulatory elements (pREs), that are likely to function as developmental brain enhancers. pREs showed temporal, regional, and laminar differences in chromatin accessibility and were correlated with gene expression differences across regions and gestational ages.
View Article and Find Full Text PDFThe CRISPR/Cas system is a highly specific genome editing tool capable of distinguishing alleles differing by even a single base pair. Target sites might carry genetic variations that are not distinguishable by sgRNA designing tools based on one reference genome. AlleleAnalyzer is an open-source software that incorporates single-nucleotide variants and short insertions and deletions to design sgRNAs for precisely editing 1 or multiple haplotypes of a sequenced genome, currently supporting 11 Cas proteins.
View Article and Find Full Text PDFGlycosylation alterations are indicative of tissue inflammation and neoplasia, but whether these alterations contribute to disease pathogenesis is largely unknown. To study the role of glycan changes in pancreatic disease, we inducibly expressed human fucosyltransferase 3 and β1,3-galactosyltransferase 5 in mice, reconstituting the glycan sialyl-Lewis, also known as carbohydrate antigen 19-9 (CA19-9). Notably, CA19-9 expression in mice resulted in rapid and severe pancreatitis with hyperactivation of epidermal growth factor receptor (EGFR) signaling.
View Article and Find Full Text PDFChromatin interactions and linkage disequilibrium (LD) are both pairwise measurements between genomic loci that show block patterns along mammalian chromosomes. Their values are generally high for sites that are nearby in the linear genome but abruptly drop across block boundaries. One function of chromatin boundaries is to insulate regulatory domains from one another.
View Article and Find Full Text PDFBackground: Cluster heatmaps are commonly used in biology and related fields to reveal hierarchical clusters in data matrices. This visualization technique has high data density and reveal clusters better than unordered heatmaps alone. However, cluster heatmaps have known issues making them both time consuming to use and prone to error.
View Article and Find Full Text PDFVariability in induced pluripotent stem cell (iPSC) lines remains a concern for disease modeling and regenerative medicine. We have used RNA-sequencing analysis and linear mixed models to examine the sources of gene expression variability in 317 human iPSC lines from 101 individuals. We found that ∼50% of genome-wide expression variability is explained by variation across individuals and identified a set of expression quantitative trait loci that contribute to this variation.
View Article and Find Full Text PDFDiscriminating the gene target of a distal regulatory element from other nearby transcribed genes is a challenging problem with the potential to illuminate the causal underpinnings of complex diseases. We present TargetFinder, a computational method that reconstructs regulatory landscapes from diverse features along the genome. The resulting models accurately predict individual enhancer-promoter interactions across multiple cell lines with a false discovery rate up to 15 times smaller than that obtained using the closest gene.
View Article and Find Full Text PDFPrediction problems in biomedical sciences, including protein function prediction (PFP), are generally quite difficult. This is due in part to incomplete knowledge of the cellular phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor for specific problems. In such scenarios, a powerful approach to improving prediction performance is to construct heterogeneous ensemble predictors that combine the output of diverse individual predictors that capture complementary aspects of the problems and/or datasets.
View Article and Find Full Text PDFProtein interaction networks are a promising type of data for studying complex biological systems. However, despite the rich information embedded in these networks, these networks face important data quality challenges of noise and incompleteness that adversely affect the results obtained from their analysis. Here, we apply a robust measure of local network structure called common neighborhood similarity (CNS) to address these challenges.
View Article and Find Full Text PDFWe introduce a theory of sequential causal inference in which learners in a chain estimate a structural model from their upstream "teacher" and then pass samples from the model to their downstream "student". It extends the population dynamics of genetic drift, recasting Kimura's selectively neutral theory as a special case of a generalized drift process using structured populations with memory. We examine the diffusion and fixation properties of several drift processes and propose applications to learning, inference, and evolution.
View Article and Find Full Text PDF