AI Article Synopsis

  • - Accurate identification of protein-DNA binding sites is crucial for understanding proteins and drug design, but machine-learning methods face challenges due to data imbalance, where nonbinding residues vastly outnumber binding residues.
  • - This study introduces a two-stage machine-learning algorithm called E-HDSVM, which uses a new sampling method and an enhanced AdaBoost algorithm to improve predictions by addressing the data imbalance issue.
  • - The performance of E-HDSVM was validated through extensive testing, resulting in a highly accurate protein-DNA binding site predictor, DNAPred, achieving 91.7% accuracy and outperforming existing models, and it's available for academic use.

Article Abstract

Accurate identification of protein-DNA binding sites is significant for both understanding protein function and drug design. Machine-learning-based methods have been extensively used for the prediction of protein-DNA binding sites. However, the data imbalance problem, in which the number of nonbinding residues (negative-class samples) is far larger than that of binding residues (positive-class samples), seriously restricts the performance improvements of machine-learning-based predictors. In this work, we designed a two-stage imbalanced learning algorithm, called ensembled hyperplane-distance-based support vector machines (E-HDSVM), to improve the prediction performance of protein-DNA binding sites. The first stage of E-HDSVM designs a new iterative sampling algorithm, called hyperplane-distance-based under-sampling (HD-US), to extract multiple subsets from the original imbalanced data set, each of which is used to train a support vector machine (SVM). Unlike traditional sampling algorithms, HD-US selects samples by calculating the distances between the samples and the separating hyperplane of the SVM. The second stage of E-HDSVM proposes an enhanced AdaBoost (EAdaBoost) algorithm to ensemble multiple trained SVMs. As an enhanced version of the original AdaBoost algorithm, EAdaBoost overcomes the overfitting problem. Stringent cross-validation and independent tests on benchmark data sets demonstrated the superiority of E-HDSVM over several popular imbalanced learning algorithms. Based on the proposed E-HDSVM algorithm, we further implemented a sequence-based protein-DNA binding site predictor, called DNAPred, which is freely available at http://csbio.njust.edu.cn/bioinf/dnapred/ for academic use. The computational experimental results showed that our predictor achieved an average overall accuracy of 91.7% and a Mathew's correlation coefficient of 0.395 on five benchmark data sets and outperformed several state-of-the-art sequence-based protein-DNA binding site predictors.

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.jcim.8b00749DOI Listing

Publication Analysis

Top Keywords

protein-dna binding
20
support vector
12
binding sites
12
accurate identification
8
ensembled hyperplane-distance-based
8
hyperplane-distance-based support
8
vector machines
8
imbalanced learning
8
algorithm called
8
stage e-hdsvm
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!