An essential step in engineering proteins and understanding disease-causing missense mutations is to accurately model protein stability changes when such mutations occur. Here, we developed a new sequence-based predictor for the tein ability (PROST) change (Gibb's free energy change, ΔΔ) upon a single-point missense mutation. PROST extracts multiple descriptors from the most promising sequence-based predictors, such as BoostDDG, SAAFEC-SEQ, and DDGun. RPOST also extracts descriptors from iFeature and AlphaFold2. The extracted descriptors include sequence-based features, physicochemical properties, evolutionary information, evolutionary-based physicochemical properties, and predicted structural features. The PROST predictor is a weighted average ensemble model based on extreme gradient boosting (XGBoost) decision trees and an extra-trees regressor; PROST is trained on both direct and hypothetical reverse mutations using the S5294 (S2647 direct mutations + S2647 inverse mutations). The parameters for the PROST model are optimized using grid searching with 5-fold cross-validation, and feature importance analysis unveils the most relevant features. The performance of PROST is evaluated in a blinded manner, employing nine distinct data sets and existing state-of-the-art sequence-based and structure-based predictors. This method consistently performs well on frataxin, S217, S349, Ssym, S669, Myoglobin, and CAGI5 data sets in blind tests and similarly to the state-of-the-art predictors for p53 and S276 data sets. When the performance of PROST is compared with the latest predictors such as BoostDDG, SAAFEC-SEQ, ACDC-NN-seq, and DDGun, PROST dominates these predictors. A case study of mutation scanning of the frataxin protein for nine wild-type residues demonstrates the utility of PROST. Taken together, these findings indicate that PROST is a well-suited predictor when no protein structural information is available. The source code of PROST, data sets, examples, and pretrained models along with how to use PROST are available at https://github.com/ShahidIqb/PROST and https://prost.erc.monash.edu/seq.

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.jcim.2c00799DOI Listing

Publication Analysis

Top Keywords

data sets
16
prost
13
sequence-based predictor
8
protein stability
8
stability changes
8
missense mutations
8
predictors boostddg
8
boostddg saafec-seq
8
physicochemical properties
8
performance prost
8

Similar Publications

Background: In data-sparse areas such as health care, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning models' outputs. As a standard, categorical data, such as patients' gender, socioeconomic status, or skin color, are used to train models in fusion with other data types, such as medical images and text-based medical information. However, the effects of including categorical data features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population.

View Article and Find Full Text PDF

Identification of potential drug-target interactions (DTIs) is a crucial step in drug discovery and repurposing. Although deep learning effectively deciphers DTIs, most deep learning-based methods represent drug features from only a single perspective. Moreover, the fusion method of drug and protein features needs further refinement.

View Article and Find Full Text PDF

Consumer concerns regarding food nutrition and quality are becoming increasingly prevalent. High-resolution mass spectrometry (HRMS)-based metabolomics stands as a cutting-edge and widely embraced technique in the realm of food component analysis and detection. It boasts the capability to identify character metabolites at exceedingly low abundances, which remain undetectable by conventional platforms.

View Article and Find Full Text PDF

QUEST#4X: An Extension of QUEST#4 for Benchmarking Multireference Wave Function Methods.

J Chem Theory Comput

January 2025

Qingdao Institute for Theoretical and Computational Sciences and Center for Optics Research and Engineering, Shandong University, Qingdao 266237, China.

Given a number of data sets for evaluating the performance of single reference methods for the low-lying excited states of closed-shell molecules, a comprehensive data set for assessing the performance of multireference methods for the low-lying excited states of open-shell systems is still lacking. For this reason, we propose an extension (QUEST#4X) of the radical subset of QUEST#4 ( , , 3720) to cover 110 doublet and 39 quartet excited states. Near-exact results obtained by iterative configuration interaction with selection and second-order perturbation correction (iCIPT2) are taken as benchmark to calibrate static-dynamic-static configuration interaction (SDSCI) and static-dynamic-static second-order perturbation theory (SDSPT2), which are minimal MRCI and CI-like perturbation theory, respectively.

View Article and Find Full Text PDF

Background: Spatial data are often aggregated by area to protect the confidentiality of individuals and aid the calculation of pertinent risks and rates. However, the analysis of spatially aggregated data is susceptible to the modifiable areal unit problem (MAUP), which arises when inference varies with boundary or aggregation changes. While the impact of the MAUP has been examined previously, typically these studies have focused on well-populated areas.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!