Comparison of RefSeq protein-coding regions in human and vertebrate genomes.

Jessica H Fong Terence D Murphy Kim D Pruitt

BMC Genomics

National Center for Biotechnology Information, U,S, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA.

Published: September 2013

Background: Advances in high-throughput sequencing technology have yielded a large number of publicly available vertebrate genomes, many of which are selected for inclusion in NCBI's RefSeq project and subsequently processed by NCBI's eukaryotic annotation pipeline. Genome annotation results are affected by differences in available support evidence and may be impacted by annotation pipeline software changes over time. The RefSeq project has not previously assessed annotation trends across organisms or over time. To address this deficiency, we have developed a comparative protocol which integrates analysis of annotated protein-coding regions across a data set of vertebrate orthologs in genomic sequence coordinates, protein sequences, and protein features.

Results: We assessed an ortholog dataset that includes 34 annotated vertebrate RefSeq genomes including human. We confirm that RefSeq protein-coding gene annotations in mammals exhibit considerable similarity. Over 50% of the orthologous protein-coding genes in 20 organisms are supported at the level of splicing conservation with at least three selected reference genomes. Approximately 7,500 ortholog sets include at least half of the analyzed organisms, show highly similar sequence and conserved splicing, and may serve as a minimal set of mammalian "core proteins" for initial assessment of new mammalian genomes. Additionally, 80% of the proteins analyzed pass a suite of tests to detect proteins that lack splicing conservation and have unusual sequence or domain annotation. We use these tests to define an annotation quality metric that is based directly on the annotated proteins thus operates independently of other quality metrics such as availability of transcripts or assembly quality measures. Results are available on the RefSeq FTP site [http://ftp.ncbi.nlm.nih.gov/refseq/supplemental/ProtCore/SM1.txt].

Conclusions: Our multi-factored analysis demonstrates a high level of consistency in RefSeq protein representation among vertebrates. We find that the majority of the RefSeq vertebrate proteins for which we have calculated orthology are good as measured by these metrics. The process flow described provides specific information on the scope and degree of conservation for the analyzed protein sequences and annotations and will be used to enrich the quality of RefSeq records by identifying targets for further improvement in the computational annotation pipeline, and by flagging specific genes for manual curation.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3882889	PMC
http://dx.doi.org/10.1186/1471-2164-14-654	DOI Listing

Publication Analysis

Top Keywords

annotation pipeline

refseq protein-coding

protein-coding regions

vertebrate genomes

refseq

refseq project

protein sequences

splicing conservation

annotation

vertebrate

Similar Publications

Internal validation of a convolutional neural network pipeline for assessing meibomian gland structure from meibography.

Optom Vis Sci

January 2025

Johnson & Johnson MedTech (Vision), Irvine, California.

Charles Scales John Bai David Murakami Joshua Young Daniel Cheng

Significance: Optimal meibography utilization and interpretation are hindered due to poor lid presentation, blurry images, or image artifacts and the challenges of applying clinical grading scales. These results, using the largest image dataset analyzed to date, demonstrate development of algorithms that provide standardized, real-time inference that addresses all of these limitations.

Purpose: This study aimed to develop and validate an algorithmic pipeline to automate and standardize meibomian gland absence assessment and interpretation.

View Article and Find Full Text PDF

Similar Publications

G-SET-DCL: a guided sequential episodic training with dual contrastive learning approach for colon segmentation.

Int J Comput Assist Radiol Surg

January 2025

Computer Vision and Image Processing Lab., UofL, Louisville, KY, 40292, USA.

Samir Farag Harb Asem Ali Mohamed Yousuf Salwa Elshazly Aly Farag

Purpose: This article introduces a novel deep learning approach to substantially improve the accuracy of colon segmentation even with limited data annotation, which enhances the overall effectiveness of the CT colonography pipeline in clinical settings.

Methods: The proposed approach integrates 3D contextual information via guided sequential episodic training in which a query CT slice is segmented by exploiting its previous labeled CT slice (i.e.

View Article and Find Full Text PDF

Similar Publications

Leveraging Transformers-based models and linked data for deep phenotyping in radiology.

Comput Methods Programs Biomed

January 2025

Laberit, Avda. de Catalunya, 9, València, 46020, Spain.

Lluís-F Hurtado Luis Marco-Ruiz Encarna Segarra Maria Jose Castro-Bleda Aurelia Bustos-Moreno

Background And Objective: Despite significant investments in the normalization and the standardization of Electronic Health Records (EHRs), free text is still the rule rather than the exception in clinical notes. The use of free text has implications in data reuse methods used for supporting clinical research since the query mechanisms used in cohort definition and patient matching are mainly based on structured data and clinical terminologies. This study aims to develop a method for the secondary use of clinical text by: (a) using Natural Language Processing (NLP) for tagging clinical notes with biomedical terminology; and (b) designing an ontology that maps and classifies all the identified tags to various terminologies and allows for running phenotyping queries.

View Article and Find Full Text PDF

Similar Publications

Specifying cellular context of transcription factor regulons for exploring context-specific gene regulation programs.

NAR Genom Bioinform

March 2025

Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Tomtebodavägen 23A, 17165 Solna, Sweden.

Mariia Minaeva Júlia Domingo Philipp Rentzsch Tuuli Lappalainen

Understanding the role of transcription and transcription factors (TFs) in cellular identity and disease, such as cancer, is essential. However, comprehensive data resources for cell line-specific TF-to-target gene annotations are currently limited. To address this, we employed a straightforward method to define regulons that capture the cell-specific aspects of TF binding and transcript expression levels.

View Article and Find Full Text PDF

Similar Publications

A Transformer-Based Pipeline for German Clinical Document De-Identification.

Appl Clin Inform

January 2025

Institute for Artificial Intelligence in Medicine, University Hospital Essen, Essen, Germany.

Kamyar Arzideh Giulia Baldini Philipp Winnekens Christoph M Friedrich Felix Nensa

Objective: Commercially available large language models such as Chat Generative Pre-Trained Transformer (ChatGPT) cannot be applied to real patient data for data protection reasons. At the same time, de-identification of clinical unstructured data is a tedious and time-consuming task when done manually. Since transformer models can efficiently process and analyze large amounts of text data, our study aims to explore the impact of a large training dataset on the performance of this task.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!