Identification of protein coding regions in RNA transcripts.

Nucleic Acids Res

Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA 30332, USA Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Moscow, Russia

Published: July 2015

Massive parallel sequencing of RNA transcripts by next-generation technology (RNA-Seq) generates critically important data for eukaryotic gene discovery. Gene finding in transcripts can be done by statistical (alignment-free) as well as by alignment-based methods. We describe a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions in RNA transcripts. The algorithm parameters are estimated by unsupervised training which makes unnecessary manually curated preparation of training sets. We demonstrate that (i) the unsupervised training is robust with respect to the presence of transcripts assembly errors and (ii) the accuracy of GeneMarkS-T in identifying protein-coding regions and, particularly, in predicting translation initiation sites in modelled as well as in assembled transcripts compares favourably to other existing methods.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4499116PMC
http://dx.doi.org/10.1093/nar/gkv227DOI Listing

Publication Analysis

Top Keywords

rna transcripts
12
regions rna
8
protein-coding regions
8
unsupervised training
8
transcripts
6
identification protein
4
protein coding
4
coding regions
4
transcripts massive
4
massive parallel
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!