SOLpro: accurate sequence-based prediction of protein solubility.

Christophe N Magnan Arlo Randall Pierre Baldi

Bioinformatics

Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California, Irvine, CA, USA.

Published: September 2009

Motivation: Protein insolubility is a major obstacle for many experimental studies. A sequence-based prediction method able to accurately predict the propensity of a protein to be soluble on overexpression could be used, for instance, to prioritize targets in large-scale proteomics projects and to identify mutations likely to increase the solubility of insoluble proteins.

Results: Here, we first curate a large, non-redundant and balanced training set of more than 17 000 proteins. Next, we extract and study 23 groups of features computed directly or predicted (e.g. secondary structure) from the primary sequence. The data and the features are used to train a two-stage support vector machine (SVM) architecture. The resulting predictor, SOLpro, is compared directly with existing methods and shows significant improvement according to standard evaluation metrics, with an overall accuracy of over 74% estimated using multiple runs of 10-fold cross-validation.

Download full-text PDF	Source
http://dx.doi.org/10.1093/bioinformatics/btp386	DOI Listing

Publication Analysis

Top Keywords

sequence-based prediction

solpro accurate

accurate sequence-based

prediction protein

protein solubility

solubility motivation

motivation protein

protein insolubility

insolubility major

major obstacle

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!