DNA breathing dynamics-transient base-pair opening and closing due to thermal fluctuations-are vital for processes like transcription, replication, and repair. Traditional models, such as the Extended Peyrard-Bishop-Dauxois (EPBD), provide insights into these dynamics but are computationally limited for long sequences. We present , a high-throughput Langevin molecular dynamics framework leveraging JAX for GPU-accelerated simulations, achieving up to 30x speedup and superior scalability compared to the original C-based EPBD implementation.
View Article and Find Full Text PDFIt was previously shown that DNA breathing, thermodynamic stability, as well as transcriptional activity and transcription factor (TF) bindings are functionally correlated. To ascertain the precise relationship between TF binding and DNA breathing, we developed the multi-modal deep learning model EPBDxDNABERT-2, which is based on the Extended Peyrard-Bishop-Dauxois (EPBD) nonlinear DNA dynamics model. To train our EPBDxDNABERT-2, we used chromatin immunoprecipitation sequencing (ChIP-Seq) data comprising 690 ChIP-seq experimental results encompassing 161 distinct TFs and 91 human cell types.
View Article and Find Full Text PDFMotivation: Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.
View Article and Find Full Text PDFThe fibroblast growth factor receptor 2 (2) gene is one of the most extensively studied genes with many known mutations implicated in several human disorders, including oncogenic ones. Most FGFR2 disease-associated gene mutations are missense mutations that result in constitutive activation of the FGFR2 protein and downstream molecular pathways. Many tertiary structures of the FGFR2 kinase domain are publicly available in the wildtype and mutated forms and in the inactive and activated state of the receptor.
View Article and Find Full Text PDFOver the years, many computational methods have been created for the analysis of the impact of single amino acid substitutions resulting from single-nucleotide variants in genome coding regions. Historically, all methods have been supervised and thus limited by the inadequate sizes of experimentally curated data sets and by the lack of a standardized definition of variant effect. The emergence of unsupervised, deep learning (DL)-based methods raised an important question: Can machines learn the language of life from the unannotated protein sequence data well enough to identify significant errors in the protein "sentences"? Our analysis suggests that some unsupervised methods perform as well or better than existing supervised methods.
View Article and Find Full Text PDFUnderstanding the impact of genomic variants on transcription factor binding and gene regulation remains a key area of research, with implications for unraveling the complex mechanisms underlying various functional effects. Our study delves into the role of DNA's biophysical properties, including thermodynamic stability, shape, and flexibility in transcription factor (TF) binding. We developed a multi-modal deep learning model integrating these properties with DNA sequence data.
View Article and Find Full Text PDFMotivation: The two strands of the DNA double helix locally and spontaneously separate and recombine in living cells due to the inherent thermal DNA motion. This dynamics results in transient openings in the double helix and is referred to as "DNA breathing" or "DNA bubbles." The propensity to form local transient openings is important in a wide range of biological processes, such as transcription, replication, and transcription factors binding.
View Article and Find Full Text PDFMotivation: The two strands of the DNA double helix locally and spontaneously separate and recombine in living cells due to the inherent thermal DNA motion.This dynamics results in transient openings in the double helix and is referred to as "DNA breathing" or "DNA bubbles." The propensity to form local transient openings is important in a wide range of biological processes, such as transcription, replication, and transcription factors binding.
View Article and Find Full Text PDFProtein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open.
View Article and Find Full Text PDFOver the past decade, Markov State Models (MSM) have emerged as powerful methodologies to build discrete models of dynamics over structures obtained from Molecular Dynamics trajectories. The identification of macrostates for the MSM is a central decision that impacts the quality of the MSM but depends on both the selected representation of a structure and the clustering algorithm utilized over the featurized structures. Motivated by a large molecular system in its free and bound state, this paper investigates two directions of research, further reducing the representation dimensionality in a non-parametric, data-driven manner and including more structures in the computation.
View Article and Find Full Text PDFWith the debut of AlphaFold2, we now can get a highly-accurate view of a reasonable equilibrium tertiary structure of a protein molecule. Yet, a single-structure view is insufficient and does not account for the high structural plasticity of protein molecules. Obtaining a multi-structure view of a protein molecule continues to be an outstanding challenge in computational structural biology.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
October 2023
We have long known that characterizing protein structures structure is key to understanding protein function. Computational approaches have largely addressed a narrow formulation of the problem, seeking to compute one native structure from an amino-acid sequence. Now AlphaFold2 is shown to be able to reveal a high-quality native structure for many proteins.
View Article and Find Full Text PDFMany biological and biotechnological processes are controlled by protein-protein and protein-solvent interactions. In order to understand, predict, and optimize such processes, it is important to understand how solvents affect protein structure during protein-solvent interactions. In this study, all-atom molecular dynamics are used to investigate the structural dynamics and energetic properties of a C-terminal domain of the Rift Valley Fever Virus L protein solvated in glycerol and aqueous glycerol solutions in different concentrations by molecular weight.
View Article and Find Full Text PDFJ Bioinform Comput Biol
February 2021
Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods. A significant challenge in elucidating these dark regions relates to the ability to discriminate relevant structure(s) among many structures/decoys computed for a protein of interest, a problem known as decoy selection. Clustering decoys based on geometric similarity remains popular.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
June 2022
A central challenge in protein modeling research and protein structure prediction in particular is known as decoy selection. The problem refers to selecting biologically-active/native tertiary structures among a multitude of physically-realistic structures generated by template-free protein structure prediction methods. Research on decoy selection is active.
View Article and Find Full Text PDFCurr Opin Struct Biol
April 2021
Much scientific enquiry across disciplines is founded upon a mechanistic treatment of dynamic systems that ties form to function. A highly visible instance of this is in molecular biology, where characterizing macromolecular structure and dynamics is central to a detailed, molecular-level understanding of biological processes in the living cell. The current computational paradigm utilizes optimization as the generative process for modeling both structure and structural dynamics.
View Article and Find Full Text PDFBackground: Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite some issues associated with clustering large decoy sets and decoy sets that do not show much structural similarity.
View Article and Find Full Text PDFControlling the quality of tertiary structures computed for a protein molecule remains a central challenge in de-novo protein structure prediction. The rule of thumb is to generate as many structures as can be afforded, effectively acknowledging that having more structures increases the likelihood that some will reside near the sought biologically-active structure. A major drawback with this approach is that computing a large number of structures imposes time and space costs.
View Article and Find Full Text PDFIEEE Trans Nanobioscience
July 2020
The three-dimensional structures populated by a protein molecule determine to a great extent its biological activities. The rich information encoded by protein structure on protein function continues to motivate the development of computational approaches for determining functionally-relevant structures. The majority of structures generated in silico are not relevant.
View Article and Find Full Text PDFRapid growth in molecular structure data is renewing interest in featurizing structure. Featurizations that retain information on biological activity are particularly sought for protein molecules, where decades of research have shown that indeed structure encodes function. Research on featurization of protein structure is active, but here we assess the promise of autoencoders.
View Article and Find Full Text PDFJ Bioinform Comput Biol
December 2019
Molecular dynamics (MD) simulation software allows probing the equilibrium structural dynamics of a molecule of interest, revealing how a molecule navigates its structure space one structure at a time. To obtain a broader view of dynamics, typically one needs to launch many such simulations, obtaining many trajectories. A summarization of the equilibrium dynamics requires integrating the information in the various trajectories, and Markov State Models (MSM) are increasingly being used for this task.
View Article and Find Full Text PDF