One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set.
View Article and Find Full Text PDFIt is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at specific locations. We here investigate quantitatively the amount of variability that is allowed in protein sequence evolution, by computing the intrinsic dimension (ID) of the sequences belonging to a selection of protein families. The ID is a measure of the number of independent directions that evolution can take starting from a given sequence.
View Article and Find Full Text PDFWe introduce an approach for computing the free energy and the probability density in high-dimensional spaces, such as those explored in molecular dynamics simulations of biomolecules. The approach exploits the presence of correlations between the coordinates, induced, in molecular dynamics, by the chemical nature of the molecules. Due to these correlations, the data points lay on a manifold that can be highly curved and twisted, but whose dimension is normally small.
View Article and Find Full Text PDFAnalyzing large volumes of high-dimensional data is an issue of fundamental importance in data science, molecular simulations and beyond. Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much lower than the crude large number of coordinates. Such manifold is generally twisted and curved; in addition points on it will be non-uniformly distributed: two factors that make the identification of the ID and its exploitation really hard.
View Article and Find Full Text PDF