Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.

PLoS One

Softberry Inc., Mount Kisco, United States of America.

Published: August 2017

AI Article Synopsis

  • The study addresses the challenge of accurately identifying promoters—key DNA regions that initiate transcription—using Convolutional Neural Networks (CNN) to analyze sequence features across different organisms, including humans, mice, plants, and bacteria.
  • CNN models achieved high accuracy in classifying promoters, with significant success rates for TATA and non-TATA promoters, particularly in human and Arabidopsis sequences, indicating the effectiveness of the deep learning approach in capturing complex promoter characteristics.
  • A new program, CNNProm, has been developed to utilize these CNN models for promoter prediction, which can be broadly applied to various genomes, and includes a random substitution method to identify conserved functional elements without needing specific promoter features.

Article Abstract

Accurate computational identification of promoters remains a challenge as these key DNA regulatory regions have variable structures composed of functional motifs that provide gene-specific initiation of transcription. In this paper we utilize Convolutional Neural Networks (CNN) to analyze sequence characteristics of prokaryotic and eukaryotic promoters and build their predictive models. We trained a similar CNN architecture on promoters of five distant organisms: human, mouse, plant (Arabidopsis), and two bacteria (Escherichia coli and Bacillus subtilis). We found that CNN trained on sigma70 subclass of Escherichia coli promoter gives an excellent classification of promoters and non-promoter sequences (Sn = 0.90, Sp = 0.96, CC = 0.84). The Bacillus subtilis promoters identification CNN model achieves Sn = 0.91, Sp = 0.95, and CC = 0.86. For human, mouse and Arabidopsis promoters we employed CNNs for identification of two well-known promoter classes (TATA and non-TATA promoters). CNN models nicely recognize these complex functional regions. For human promoters Sn/Sp/CC accuracy of prediction reached 0.95/0.98/0,90 on TATA and 0.90/0.98/0.89 for non-TATA promoter sequences, respectively. For Arabidopsis we observed Sn/Sp/CC 0.95/0.97/0.91 (TATA) and 0.94/0.94/0.86 (non-TATA) promoters. Thus, the developed CNN models, implemented in CNNProm program, demonstrated the ability of deep learning approach to grasp complex promoter sequence characteristics and achieve significantly higher accuracy compared to the previously developed promoter prediction programs. We also propose random substitution procedure to discover positionally conserved promoter functional elements. As the suggested approach does not require knowledge of any specific promoter features, it can be easily extended to identify promoters and other complex functional regions in sequences of many other and especially newly sequenced genomes. The CNNProm program is available to run at web server http://www.softberry.com.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5291440PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0171410PLOS

Publication Analysis

Top Keywords

promoters
11
prokaryotic eukaryotic
8
eukaryotic promoters
8
deep learning
8
neural networks
8
sequence characteristics
8
human mouse
8
escherichia coli
8
bacillus subtilis
8
non-tata promoters
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!