Massively parallel reporter assays (MPRAs) are powerful tools for quantifying the impacts of sequence variation on gene expression. Reading out molecular phenotypes with sequencing enables interrogating the impact of sequence variation beyond genome scale. Machine learning models integrate and codify information learned from MPRAs and enable generalization by predicting sequences outside the training data set.
View Article and Find Full Text PDFThe interplay between transcription factors and chromatin accessibility regulates cell type diversification during vertebrate embryogenesis. To systematically decipher the gene regulatory logic guiding this process, we generated a single-cell multi-omics atlas of RNA expression and chromatin accessibility during early zebrafish embryogenesis. We developed a deep learning model to predict chromatin accessibility based on DNA sequence and found that a small number of transcription factors underlie cell-type-specific chromatin landscapes.
View Article and Find Full Text PDFThe RNA exosome plays critical roles in eukaryotic RNA degradation, but it remains unclear how the exosome specifically recognizes its targets. The PAXT connection is an adaptor that recruits the exosome to polyadenylated RNAs in the nucleus, especially transcripts polyadenylated at intronic poly(A) sites. Here we show that PAXT-mediated RNA degradation is induced by the combination of a 5' splice site and a poly(A) junction, but not by either sequence alone.
View Article and Find Full Text PDFAn important and largely unsolved problem in synthetic biology is how to target gene expression to specific cell types. Here, we apply iterative deep learning to design synthetic enhancers with strong differential activity between two human cell lines. We initially train models on published datasets of enhancer activity and chromatin accessibility and use them to guide the design of synthetic enhancers that maximize predicted specificity.
View Article and Find Full Text PDFMicrobial split-pool ligation transcriptomics (microSPLiT) is a high-throughput single-cell RNA sequencing method for bacteria. With four combinatorial barcoding rounds, microSPLiT can profile transcriptional states in hundreds of thousands of Gram-negative and Gram-positive bacteria in a single experiment without specialized equipment. As bacterial samples are fixed and permeabilized before barcoding, they can be collected and stored ahead of time.
View Article and Find Full Text PDFJTE-607 is an anticancer and anti-inflammatory compound and its active form, compound 2, directly binds to and inhibits CPSF73, the endonuclease for the cleavage step in pre-messenger RNA (pre-mRNA) 3' processing. Surprisingly, compound 2-mediated inhibition of pre-mRNA cleavage is sequence specific and the drug sensitivity is predominantly determined by sequences flanking the cleavage site (CS). Using massively parallel in vitro assays, we identified key sequence features that determine drug sensitivity.
View Article and Find Full Text PDFThe 5' UTRs of mRNAs are critical for translation regulation, but their regulatory features are poorly characterized. Here, we report the regulatory landscape of 5' UTRs during early zebrafish embryogenesis using a massively parallel reporter assay of 18,154 sequences coupled to polysome profiling. We found that the 5' UTR is sufficient to confer temporal dynamics to translation initiation, and identified 86 motifs enriched in 5' UTRs with distinct ribosome recruitment capabilities.
View Article and Find Full Text PDFDNA has emerged as an attractive medium for archival data storage due to its durability and high information density. Scalable parallel random access to information is a desirable property of any storage system. For DNA-based storage systems, however, this still needs to be robustly established.
View Article and Find Full Text PDFJTE-607 is a small molecule compound with anti-inflammation and anti-cancer activities. Upon entering the cell, it is hydrolyzed to Compound 2, which directly binds to and inhibits CPSF73, the endonuclease for the cleavage step in pre-mRNA 3' processing. Although CPSF73 is universally required for mRNA 3' end formation, we have unexpectedly found that Compound 2- mediated inhibition of pre-mRNA 3' processing is sequence-specific and that the sequences flanking the cleavage site (CS) are a major determinant for drug sensitivity.
View Article and Find Full Text PDFProtein-protein interactions (PPIs) regulate many cellular processes, and engineered PPIs have cell and gene therapy applications. Here we introduce massively parallel protein-protein interaction measurement by sequencing (MP3-seq), an easy-to-use and highly scalable yeast-two-hybrid approach for measuring PPIs. In MP3-seq, DNA barcodes are associated with specific protein pairs, and barcode enrichment can be read by sequencing to provide a direct measure of interaction strength.
View Article and Find Full Text PDFBackground: 3'-end processing by cleavage and polyadenylation is an important and finely tuned regulatory process during mRNA maturation. Numerous genetic variants are known to cause or contribute to human disorders by disrupting the cis-regulatory code of polyadenylation signals. Yet, due to the complexity of this code, variant interpretation remains challenging.
View Article and Find Full Text PDFDNA has emerged as a powerful substrate for programming information processing machines at the nanoscale. Among the DNA computing primitives used today, DNA strand displacement (DSD) is arguably the most popular, with DSD-based circuit applications ranging from disease diagnostics to molecular artificial neural networks. The outputs of DSD circuits are generally read using fluorescence spectroscopy.
View Article and Find Full Text PDFSequence-based neural networks can learn to make accurate predictions from large biological datasets, but model interpretation remains challenging. Many existing feature attribution methods are optimized for continuous rather than discrete input patterns and assess individual feature importance in isolation, making them ill-suited for interpreting non-linear interactions in molecular sequences. Building on work in computer vision and natural language processing, we developed an approach based on deep learning - Scrambler networks - wherein the most salient sequence positions are identified with learned input masks.
View Article and Find Full Text PDFDivision of labor between cells is ubiquitous in biology but the use of multicellular consortia for engineering applications is only beginning to be explored. A significant advantage of multicellular circuits is their potential to be modular with respect to composition but this claim has not yet been extensively tested using experiments and quantitative modeling. Here, we construct a library of 24 yeast strains capable of sending, receiving or responding to three molecular signals, characterize them experimentally and build quantitative models of their input-output relationships.
View Article and Find Full Text PDFOver just the last 2 years, mRNA therapeutics and vaccines have undergone a rapid transition from an intriguing concept to real-world impact. However, whereas some aspects of mRNA therapeutics, such as the use of chemical modifications to increase stability and reduce immunogenicity, have been extensively optimized for over two decades, other aspects, particularly the selection and design of the noncoding leader and trailer sequences which control translation efficiency and stability, have received comparably less attention. In practice, such 5' and 3' untranslated regions (UTRs) are often borrowed from highly expressed human genes with few or no modifications, as in the case for the Pfizer/BioNTech Covid vaccine.
View Article and Find Full Text PDFBioinformatics
February 2022
Motivation: Single-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns.
View Article and Find Full Text PDFBackground: Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation, which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, the current version of the method suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence.
View Article and Find Full Text PDFAs global demand for digital storage capacity grows, storage technologies based on synthetic DNA have emerged as a dense and durable alternative to traditional media. Existing approaches leverage robust error correcting codes and precise molecular mechanisms to reliably retrieve specific files from large databases. Typically, files are retrieved using a pre-specified key, analogous to a filename.
View Article and Find Full Text PDFWe performed a comprehensive analysis of the transcriptional changes occurring during human induced pluripotent stem cell (hiPSC) differentiation to cardiomyocytes. Using single cell RNA-seq, we sequenced > 20,000 single cells from 55 independent samples representing two differentiation protocols and multiple hiPSC lines. Samples included experimental replicates ranging from undifferentiated hiPSCs to mixed populations of cells at D90 post-differentiation.
View Article and Find Full Text PDFThe human neonatal cerebellum is one-fourth of its adult size yet contains the blueprint required to integrate environmental cues with developing motor, cognitive and emotional skills into adulthood. Although mature cerebellar neuroanatomy is well studied, understanding of its developmental origins is limited. In this study, we systematically mapped the molecular, cellular and spatial composition of human fetal cerebellum by combining laser capture microscopy and SPLiT-seq single-nucleus transcriptomics.
View Article and Find Full Text PDFSingle-cell RNA sequencing (scRNA-seq) has become an essential tool for characterizing gene expression in eukaryotes, but current methods are incompatible with bacteria. Here, we introduce microSPLiT (microbial split-pool ligation transcriptomics), a high-throughput scRNA-seq method for Gram-negative and Gram-positive bacteria that can resolve heterogeneous transcriptional states. We applied microSPLiT to >25,000 cells sampled at different growth stages, creating an atlas of changes in metabolism and lifestyle.
View Article and Find Full Text PDF