Motivation: Genome sequencing is producing an ever-increasing amount of associated protein sequences. Few of these sequences have experimentally validated annotations, however, and computational predictions are becoming increasingly successful in producing such annotations. One key challenge remains the prediction of the amino acids in a given protein sequence that are involved in protein-protein interactions. Such predictions are typically based on machine learning methods that take advantage of the properties and sequence positions of amino acids that are known to be involved in interaction. In this paper, we evaluate the importance of various features using Random Forest (RF), and include as a novel feature backbone flexibility predicted from sequences to further optimise protein interface prediction.

Results: We observe that there is no single sequence feature that enables pinpointing interacting sites in our Random Forest models. However, combining different properties does increase the performance of interface prediction. Our homomeric-trained RF interface predictor is able to distinguish interface from non-interface residues with an area under the ROC curve of 0.72 in a homomeric test-set. The heteromeric-trained RF interface predictor performs better than existing predictors on a independent heteromeric test-set. We trained a more general predictor on the combined homomeric and heteromeric dataset, and show that in addition to predicting homomeric interfaces, it is also able to pinpoint interface residues in heterodimers. This suggests that our random forest model and the features included capture common properties of both homodimer and heterodimer interfaces.

Availability And Implementation: The predictors and test datasets used in our analyses are freely available ( http://www.ibi.vu.nl/downloads/RF_PPI/ ).

Contact: k.a.feenstra@vu.nl.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btx005DOI Listing

Publication Analysis

Top Keywords

random forest
16
amino acids
8
interface predictor
8
interface
6
trees forest
4
forest sequence-based
4
sequence-based homo-
4
homo- heteromeric
4
heteromeric protein-protein
4
protein-protein interaction
4

Similar Publications

Background: Major depressive disorder (MDD) comes along with an increased risk of recurrence and poor course of illness. Machine learning has recently shown promise in the prediction of mental illness, yet models aiming to predict MDD course are still rare and do not quantify the predictive value of established MDD recurrence risk factors.

Methods: We analyzed N = 571 MDD patients from the Marburg-Münster Affective Disorder Cohort Study (MACS).

View Article and Find Full Text PDF

Integrating machine learning, suspect and nontarget screening reveal the interpretable fates of micropollutants and their transformation products in sludge.

J Hazard Mater

January 2025

School of Environmental Studies, China University of Geosciences, Wuhan, Hubei 430074, China; National Engineering Research Center of Industrial Wastewater Detoxication and Resource Recovery, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China. Electronic address:

Activated sludge enriches vast amounts of micropollutants (MPs) when wastewater is treated, posing potential environmental risks. While standard methods typically focus on target analysis of known compounds, the identity, structure, and concentration of transformation products (TPs) of MPs remain less understood. Here, we employed a novel approach that integrates machine learning for the quantification of nontarget TPs with advanced target, suspect, and nontarget screening strategies.

View Article and Find Full Text PDF

Plastic waste management is one of the key issues in global environmental protection. Integrating spectroscopy acquisition devices with deep learning algorithms has emerged as an effective method for rapid plastic classification. However, the challenges in collecting plastic samples and spectroscopy data have resulted in a limited number of data samples and an incomplete comparison of relevant classification algorithms.

View Article and Find Full Text PDF

IntroductionAsthma attacks are set off by triggers such as pollutants from the environment, respiratory viruses, physical activity and allergens. The aim of this research is to create a machine learning model using data from mobile health technology to predict and appropriately warn a patient to avoid such triggers.MethodsLightweight machine learning models, XGBoost, Random Forest, and LightGBM were trained and tested on cleaned asthma data with a 70-30 train-test split.

View Article and Find Full Text PDF

Purpose: The incidence of cancer, which is a serious public health concern, is increasing. A predictive analysis driven by machine learning was integrated with haematology parameters to create a method for the simultaneous diagnosis of several malignancies at different stages.

Patients And Methods: We analysed a newly collected dataset from various hospitals in Jordan comprising 19,537 laboratory reports (6,280 cancer and 13,257 noncancer cases).

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!