We propose a sampling algorithm relying on a collective variable (CV) of midsize dimension modeled by a normalizing flow and using nonequilibrium dynamics to propose full configurational moves from the proposition of a refreshed value of the CV made by the flow. The algorithm takes the form of a Markov chain with nonlocal updates, allowing jumps through energy barriers across metastable states. The flow is trained throughout the algorithm to reproduce the free energy landscape of the CV.
View Article and Find Full Text PDFComplex networks are powerful mathematical tools for modelling and understanding the behaviour of highly interconnected systems. However, existing methods for analyzing these networks focus on local properties (e.g.
View Article and Find Full Text PDFClinical databases typically include, for each patient, many heterogeneous features, for example blood exams, the clinical history before the onset of the disease, the evolution of the symptoms, the results of imaging exams, and many others. We here propose to exploit a recently developed statistical approach, the Information Imbalance, to compare different subsets of patient features and automatically select the set of features that is maximally informative for a given clinical purpose, especially in minority classes. We adapt the Information Imbalance approach to work in a clinical framework, where patient features are often categorical and are generally available only for a fraction of the patients.
View Article and Find Full Text PDFWe introduce an approach which allows detecting causal relationships between variables for which the time evolution is available. Causality is assessed by a variational scheme based on the Information Imbalance of distance ranks, a statistical test capable of inferring the relative information content of different distance measures. We test whether the predictability of a putative driven system Y can be improved by incorporating information from a potential driver system X, without explicitly modeling the underlying dynamics and without the need to compute probability densities of the dynamic variables.
View Article and Find Full Text PDFBackground And Objectives: ASPECTs is a widely used marker to identify early stroke signs on non-enhanced computed tomography (NECT), yet it presents interindividual variability and it may be hard to use for non-experts. We introduce an algorithm capable of automatically estimating the NECT volumetric extension of early acute ischemic changes in the 3D space. We compared the power of this marker with ASPECTs evaluated by experienced practitioner in predicting the clinical outcome.
View Article and Find Full Text PDFAccording to common physical chemistry wisdom, the solvent cavities hosting a solute are tightly sewn around it, practically coinciding with its van der Waals surface. Solvation entropy is primarily determined by the surface and the volume of the cavity while enthalpy is determined by the solute-solvent interaction. In this work, we challenge this picture, demonstrating by molecular dynamics simulations that the cavities surrounding the 20 amino acids deviate significantly from the molecular surface.
View Article and Find Full Text PDFMachine-learning (ML) has become a key workhorse in molecular simulations. Building an ML model in this context involves encoding the information on chemical environments using local atomic descriptors. In this work, we focus on the Smooth Overlap of Atomic Positions (SOAP) and their application in studying the properties of liquid water both in the bulk and at the hydrophobic air-water interface.
View Article and Find Full Text PDFReal-world datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this Letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
View Article and Find Full Text PDFReal-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Finding a small set of features that still retains sufficient information about the dataset is important for the successful application of many statistical learning approaches.
View Article and Find Full Text PDFModern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset.
View Article and Find Full Text PDFDADApy is a Python software package for analyzing and characterizing high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering, and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in a synthetic dataset and in a real-world application.
View Article and Find Full Text PDFProteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified.
View Article and Find Full Text PDFSingle-molecule force spectroscopy (SMFS) uses the cantilever tip of an atomic force microscopy (AFM) to apply a force able to unfold a single protein. The obtained force-distance curve encodes the unfolding pathway, and from its analysis it is possible to characterize the folded domains. SMFS has been mostly used to study the unfolding of purified proteins, in solution or reconstituted in a lipid bilayer.
View Article and Find Full Text PDFEpitopes that bind simultaneously to all human alleles of Major Histocompatibility Complex class II (MHC II) are considered one of the key factors for the development of improved vaccines and cancer immunotherapies. To engineer MHC II multiple-allele binders, we developed a protocol called PanMHC-PARCE, based on the unsupervised optimization of the epitope sequence by single-point mutations, parallel explicit-solvent molecular dynamics simulations and scoring of the MHC II-epitope complexes. The key idea is accepting mutations that not only improve the affinity but also reduce the affinity gap between the alleles.
View Article and Find Full Text PDFComputational peptide design is useful for therapeutics, diagnostics, and vaccine development. To select the most promising peptide candidates, the key is describing accurately the peptide-target interactions at the molecular level. We here review a computational peptide design protocol whose key feature is the use of all-atom explicit solvent molecular dynamics for describing the different peptide-target complexes explored during the optimization.
View Article and Find Full Text PDFWe apply two independent data analysis methodologies to locate stable climate states in an intermediate complexity climate model and analyse their interplay. First, drawing from the theory of quasi-potentials, and viewing the state space as an energy landscape with valleys and mountain ridges, we infer the relative likelihood of the identified multistable climate states and investigate the most likely transition trajectories as well as the expected transition times between them. Second, harnessing techniques from data science, and specifically manifold learning, we characterize the data landscape of the simulation output to find climate states and basin boundaries within a fully agnostic and unsupervised framework.
View Article and Find Full Text PDFBy using advanced data analysis techniques, we characterize the shape of the voids surrounding model polymers of different sizes in water, observed in molecular dynamics simulations. We find that even when the model polymer is folded, the voids are extremely rough, with branches that can extend to over 1 nm away from the polymer. Water molecules in contact with the void retain close-to-bulk properties in terms of local structure.
View Article and Find Full Text PDFUnsupervised learning is becoming an essential tool to analyze the increasingly large amounts of data produced by atomistic and molecular simulations, in material science, solid state physics, biophysics, and biochemistry. In this Review, we provide a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicate likely directions for further developments in the field. In particular, we discuss of molecular systems and present state-of-the-art algorithms of , , and , and .
View Article and Find Full Text PDFBackground: The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence.
View Article and Find Full Text PDFComputational protein design has emerged as a powerful tool capable of identifying sequences compatible with pre-defined protein structures. The sequence design protocols, implemented in the Rosetta suite, have become widely used in the protein engineering community. To understand the strengths and limitations of the Rosetta design framework, we tested several design protocols on two distinct folds (SH3-1 and Ubiquitin).
View Article and Find Full Text PDFWe analyzed a 100 μs MD trajectory of the SARS-CoV-2 main protease by a non-parametric data analysis approach which allows characterizing a free energy landscape as a simultaneous function of hundreds of variables. We identified several conformations that, when visited by the dynamics, are stable for several hundred nanoseconds. We explicitly characterize and describe these metastable states.
View Article and Find Full Text PDFOne of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set.
View Article and Find Full Text PDFMotivation: Single-molecule force spectroscopy (SMFS) experiments pose the challenge of analysing protein unfolding data (traces) coming from preparations with heterogeneous composition (e.g. where different proteins are present in the sample).
View Article and Find Full Text PDF