Motivation: Transcription factors (TFs) are DNA-binding proteins that regulate gene expression. Traditional methods predict a protein as a TF if the protein contains any DNA-binding domains (DBDs) of known TFs. However, this approach fails to identify a novel TF that does not contain any known DBDs. Recently proposed TF prediction methods do not rely on DBDs. Such methods use features of protein sequences to train a machine learning model, and then use the trained model to predict whether a protein is a TF or not. Because the 3-dimensional (3D) structure of a protein captures more information than its sequence, using 3D protein structures will likely allow for more accurate prediction of novel TFs.
Results: We propose a deep learning-based TF prediction method (StrucTFactor), which is the first method to utilize 3D secondary structural information of proteins. We compare StrucTFactor with recent state-of-the-art TF prediction methods based on ∼525 000 proteins across 12 datasets, capturing different aspects of data bias (including sequence redundancy) possibly influencing a method's performance. We find that StrucTFactor significantly (p-value<0.001) outperforms the existing TF prediction methods, improving the performance over its closest competitor by up to 17% based on Matthews correlation coefficient.
Availability: Data and source code are available at https://github.com/lieboldj/StrucTFactor and on our website at https://apps.cosy.bio/StrucTFactor/.
Supplementary Information: Included.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1093/bioinformatics/btae762 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!