Exploiting sequence-structure-function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning methods to address this gap, TM-Vec and DeepBLAST. TM-Vec allows searching for structure-structure similarities in large sequence databases.
View Article and Find Full Text PDFFor the past half-century, structural biologists relied on the notion that similar protein sequences give rise to similar structures and functions. While this assumption has driven research to explore certain parts of the protein universe, it disregards spaces that don't rely on this assumption. Here we explore areas of the protein universe where similar protein functions can be achieved by different sequences and different structures.
View Article and Find Full Text PDFMotivation: Gene regulatory networks define regulatory relationships between transcription factors and target genes within a biological system, and reconstructing them is essential for understanding cellular growth and function. Methods for inferring and reconstructing networks from genomics data have evolved rapidly over the last decade in response to advances in sequencing technology and machine learning. The scale of data collection has increased dramatically; the largest genome-wide gene expression datasets have grown from thousands of measurements to millions of single cells, and new technologies are on the horizon to increase to tens of millions of cells and above.
View Article and Find Full Text PDFUnderstanding the changes in diverse molecular pathways underlying the development of breast tumors is critical for improving diagnosis, treatment, and drug development. Here, we used RNA-profiling of canine mammary tumors (CMTs) coupled with a robust analysis framework to model molecular changes in human breast cancer. Our study leveraged a key advantage of the canine model, the frequent presence of multiple naturally occurring tumors at diagnosis, thus providing samples spanning normal tissue and benign and malignant tumors from each patient.
View Article and Find Full Text PDFATAC-seq has become a leading technology for probing the chromatin landscape of single and aggregated cells. Distilling functional regions from ATAC-seq presents diverse analysis challenges. Methods commonly used to analyze chromatin accessibility datasets are adapted from algorithms designed to process different experimental technologies, disregarding the statistical and biological differences intrinsic to the ATAC-seq technology.
View Article and Find Full Text PDFInnate lymphoid cells (ILCs) promote tissue homeostasis and immune defense but also contribute to inflammatory diseases. ILCs exhibit phenotypic and functional plasticity in response to environmental stimuli, yet the transcriptional regulatory networks (TRNs) that control ILC function are largely unknown. Here, we integrate gene expression and chromatin accessibility data to infer regulatory interactions between transcription factors (TFs) and genes within intestinal type 1, 2, and 3 ILC subsets.
View Article and Find Full Text PDFTranscriptional regulatory networks (TRNs) provide insight into cellular behavior by describing interactions between transcription factors (TFs) and their gene targets. The assay for transposase-accessible chromatin (ATAC)-seq, coupled with TF motif analysis, provides indirect evidence of chromatin binding for hundreds of TFs genome-wide. Here, we propose methods for TRN inference in a mammalian setting, using ATAC-seq data to improve gene expression modeling.
View Article and Find Full Text PDFDecisions to continue or suspend therapy with immune checkpoint inhibitors are commonly guided by tumor dynamics seen on serial imaging. However, immunotherapy responses are uniquely challenging to interpret because tumors often shrink slowly or can appear transiently enlarged due to inflammation. We hypothesized that monitoring tumor cell death in real time by quantifying changes in circulating tumor DNA (ctDNA) levels could enable early assessment of immunotherapy efficacy.
View Article and Find Full Text PDFDifferential binding of transcription factors (TFs) at cis-regulatory loci drives the differentiation and function of diverse cellular lineages. Understanding the regulatory interactions that underlie cell fate decisions requires characterizing TF binding sites (TFBS) across multiple cell types and conditions. Techniques, e.
View Article and Find Full Text PDFType 1 regulatory T cells (Tr1 cells) are induced by interleukin-27 (IL-27) and have critical roles in the control of autoimmunity and resolution of inflammation. We found that the transcription factors IRF1 and BATF were induced early on after treatment with IL-27 and were required for the differentiation and function of Tr1 cells in vitro and in vivo. Epigenetic and transcriptional analyses revealed that both transcription factors influenced chromatin accessibility and expression of the genes required for Tr1 cell function.
View Article and Find Full Text PDFGenomics Proteomics Bioinformatics
February 2015
We report a significantly-enhanced bioinformatics suite and database for proteomics research called Yale Protein Expression Database (YPED) that is used by investigators at more than 300 institutions worldwide. YPED meets the data management, archival, and analysis needs of a high-throughput mass spectrometry-based proteomics research ranging from a single laboratory, group of laboratories within and beyond an institution, to the entire proteomics community. The current version is a significant improvement over the first version in that it contains new modules for liquid chromatography-tandem mass spectrometry (LC-MS/MS) database search results, label and label-free quantitative proteomic analysis, and several scoring outputs for phosphopeptide site localization.
View Article and Find Full Text PDFEfficient DNA double-strand break (DSB) repair is a critical determinant of cell survival in response to DNA damaging agents, and it plays a key role in the maintenance of genomic integrity. Homologous recombination (HR) and non-homologous end-joining (NHEJ) represent the two major pathways by which DSBs are repaired in mammalian cells. We now understand that HR and NHEJ repair are composed of multiple sub-pathways, some of which still remain poorly understood.
View Article and Find Full Text PDFWhole-exome sequencing (WES) studies have demonstrated the contribution of de novo loss-of-function single-nucleotide variants (SNVs) to autism spectrum disorder (ASD). However, challenges in the reliable detection of de novo insertions and deletions (indels) have limited inclusion of these variants in prior analyses. By applying a robust indel detection method to WES data from 787 ASD families (2,963 individuals), we demonstrate that de novo frameshift indels contribute to ASD risk (OR = 1.
View Article and Find Full Text PDFParasitic protozoa of the flagellate order Kinetoplastida represent one of the deepest branches of the eukaryotic tree. Among this group of organisms, the mechanism of RNA interference (RNAi) has been investigated in Trypanosoma brucei and to a lesser degree in Leishmania (Viannia) spp. The pathway is triggered by long double-stranded RNA (dsRNA) and in T.
View Article and Find Full Text PDFCongenital heart disease (CHD) is the most frequent birth defect, affecting 0.8% of live births. Many cases occur sporadically and impair reproductive fitness, suggesting a role for de novo mutations.
View Article and Find Full Text PDFProc Natl Acad Sci U S A
April 2013
As the Pacific-Farallon spreading center approached North America, the Farallon plate fragmented into a number of small plates. Some of the microplate fragments ceased subducting before the spreading center reached the trench. Most tectonic models have assumed that the subducting oceanic slab detached from these microplates close to the trench, but recent seismic tomography studies have revealed a high-velocity anomaly beneath Baja California that appears to be a fossil slab still attached to the Guadalupe and Magdalena microplates.
View Article and Find Full Text PDFAmong trypanosomatid protozoa the mechanism of RNA interference (RNAi) has been investigated in Trypanosoma brucei and to a lesser extent in Leishmania braziliensis. Although these two parasitic organisms belong to the same family, they are evolutionarily distantly related raising questions about the conservation of the RNAi pathway. Here we carried out an in-depth analysis of small interfering RNAs (siRNAs) associated with L.
View Article and Find Full Text PDFDetection of cell-free tumor DNA in the blood has offered promise as a cancer biomarker, but practical clinical implementations have been impeded by the lack of a sensitive and accurate method for quantitation that is also simple, inexpensive, and readily scalable. Here we present an approach that uses next-generation sequencing to quantify the small fraction of DNA molecules that contain tumor-specific mutations within a background of normal DNA in plasma. Using layers of sequence redundancy designed to distinguish true mutations from sequencer misreads and PCR misincorporations, we achieved a detection sensitivity of approximately 1 variant in 5,000 molecules.
View Article and Find Full Text PDFMultiple studies have confirmed the contribution of rare de novo copy number variations to the risk for autism spectrum disorders. But whereas de novo single nucleotide variants have been identified in affected individuals, their contribution to risk has yet to be clarified. Specifically, the frequency and distribution of these mutations have not been well characterized in matched unaffected controls, and such data are vital to the interpretation of de novo coding mutations observed in probands.
View Article and Find Full Text PDFCalculating the longitudinal extension of the average attributable fraction (LE-AAF) for many risk factors (RFs) requires a two-stage computational process using only those combinations of RFs observed in the dataset. We first screen candidates RFs in a Cox Model, and assuming piecewise constant hazards, use pooled logistic regression to model the probability of death as a function of combinations of selected RFs. We average the iterative differencing of the attributable fractions calculated for all overlapping subsets of co-occurring RFs to obtain a LE-AAF for each RF that is additive and symmetrical.
View Article and Find Full Text PDFBackground: Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions.
View Article and Find Full Text PDFDuplicated pseudogenes in the human genome are disabled copies of functioning parent genes. They result from block duplication events occurring throughout evolutionary history. Relatively recent duplications (with sequence similarity≥90% and length≥1 kb) are termed segmental duplications (SDs); here, we analyze the interrelationship of SDs and pseudogenes.
View Article and Find Full Text PDFBackground: Pseudogenes provide a record of the molecular evolution of genes. As glycolysis is such a highly conserved and fundamental metabolic pathway, the pseudogenes of glycolytic enzymes comprise a standardized genomic measuring stick and an ideal platform for studying molecular evolution. One of the glycolytic enzymes, glyceraldehyde-3-phosphate dehydrogenase (GAPDH), has already been noted to have one of the largest numbers of associated pseudogenes, among all proteins.
View Article and Find Full Text PDFPersonal-genomics endeavors, such as the 1000 Genomes project, are generating maps of genomic structural variants by analyzing ends of massively sequenced genome fragments. To process these we developed Paired-End Mapper (PEMer; http://sv.gersteinlab.
View Article and Find Full Text PDFBackground: The availability of genome sequences of numerous organisms allows comparative study of pseudogenes in syntenic regions. Conservation of pseudogenes suggests that they might have a functional role in some instances.
Results: We report the first large-scale comparative analysis of ribosomal protein pseudogenes in four mammalian genomes (human, chimpanzee, mouse and rat).