PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions.

Bioinformatics

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar Street 32-D510, Cambridge, MA 02139, USA.

Published: July 2011

Motivation: As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models.

Results: We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures.

Availability And Implementation: The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF CONTACT: mlin@mit.edu; manoli@mit.edu.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3117341PMC
http://dx.doi.org/10.1093/bioinformatics/btr209DOI Listing

Publication Analysis

Top Keywords

protein coding
12
comparative genomics
8
genomics method
8
coding non-coding
8
phylocsf comparative
4
method distinguish
4
distinguish protein
4
non-coding regions
4
regions motivation
4
motivation high-throughput
4

Similar Publications

Dysregulation of long non-coding RNAs (lncRNAs) is implicated in the pathophysiology of ischemic stroke (IS). However, the molecular mechanism of the lncRNA SERPINB9P1 in IS remains unclear. Our study aimed to explore the role and molecular mechanism of the lncRNA SERPINB9P1 in IS.

View Article and Find Full Text PDF

DisGeNet: a disease-centric interaction database among diseases and various associated genes.

Database (Oxford)

January 2025

School of Computer Science and Technology, Xidian University, 266 Xinglong Section of Xifeng Road, Xi'an, Shaanxi 710126, China.

The pathogenesis of complex diseases is intricately linked to various genes and network medicine has enhanced understanding of diseases. However, most network-based approaches ignore interactions mediated by noncoding RNAs (ncRNAs) and most databases only focus on the association between genes and diseases. Based on the mentioned questions, we have developed DisGeNet, a database focuses not only on the disease-associated genes but also on the interactions among genes.

View Article and Find Full Text PDF

A pseudogene is a non-functional copy of a protein-coding gene. Processed pseudogenes, which are created by the reverse transcription of mRNA and subsequent integration of the resulting cDNA into the genome, being a major pseudogene class, represent a significant challenge in genome analysis due to their high sequence similarity to the parent genes and their frequent absence in the reference genome. This homology can lead to errors in variant identification, as sequences derived from processed pseudogenes can be incorrectly assigned to parental genes, complicating correct variant calling.

View Article and Find Full Text PDF

Association of Novel Pathogenic Variant (p. Ile366Asn) in Gene with Infantile Neuroaxonal Dystrophy.

Int J Mol Sci

January 2025

Department of Human Genetics, School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA.

A couple presented to the office with an apparently healthy infant for a thorough clinical assessment, as they had previously lost two male children to a neurodegenerative disorder. They also reported the death of a male cousin abroad with a comparable condition. We aimed to evaluate a novel coding pathogenic variant c.

View Article and Find Full Text PDF

Systemic lupus erythematosus (SLE) is a complex autoimmune disorder characterized by widespread inflammation and autoantibody production. Its development and progression involve genetic, epigenetic, and environmental factors. Although genome-wide association studies (GWAS) have repeatedly identified a susceptibility signal at 16p13, its fine-scale source and its functional and mechanistic role in SLE remain unclear.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!